InstantID: Zero-shot Identity-Preserving Generation in Seconds
Pith reviewed 2026-05-17 20:57 UTC · model grok-4.3
The pith
InstantID generates high-fidelity personalized images from one face photo in seconds without fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that InstantID achieves zero-shot identity-preserving generation through a novel IdentityNet that imposes strong semantic conditions from a single facial image and weak spatial conditions from landmark images, integrated with textual prompts to guide the diffusion process. This design delivers high face fidelity across styles, requires no fine-tuning, operates via a single forward inference, and serves as a compatible plugin for pre-trained models such as SD1.5 and SDXL.
What carries the argument
IdentityNet, which imposes strong semantic conditions from a facial image and weak spatial conditions from landmarks while integrating with textual prompts to steer diffusion-based generation.
If this is right
- Personalized image synthesis works using only one reference facial image.
- High face fidelity holds when generating outputs in different artistic styles.
- No fine-tuning of the underlying diffusion model is required.
- The module functions as a direct plugin for popular pre-trained models such as SD1.5 and SDXL.
- Generation completes efficiently with a single forward inference pass.
Where Pith is reading between the lines
- The single-inference design could support generating multiple consistent images of the same person without repeated conditioning steps.
- Extending the semantic-spatial split might help preserve identity when editing poses or expressions in the output.
- Compatibility with existing models suggests the technique could be tested on related tasks such as preserving identity in short video clips.
Load-bearing premise
The assumption that strong semantic conditions from one face image plus weak spatial conditions from landmarks will maintain high face fidelity across styles without fine-tuning or additional references.
What would settle it
Generate images of a reference person in an unseen style or pose using only the single input image, then measure whether face recognition accuracy or human identification rates stay high compared to the original photo.
read the original abstract
There has been significant progress in personalized image synthesis with methods such as Textual Inversion, DreamBooth, and LoRA. Yet, their real-world applicability is hindered by high storage demands, lengthy fine-tuning processes, and the need for multiple reference images. Conversely, existing ID embedding-based methods, while requiring only a single forward inference, face challenges: they either necessitate extensive fine-tuning across numerous model parameters, lack compatibility with community pre-trained models, or fail to maintain high face fidelity. Addressing these limitations, we introduce InstantID, a powerful diffusion model-based solution. Our plug-and-play module adeptly handles image personalization in various styles using just a single facial image, while ensuring high fidelity. To achieve this, we design a novel IdentityNet by imposing strong semantic and weak spatial conditions, integrating facial and landmark images with textual prompts to steer the image generation. InstantID demonstrates exceptional performance and efficiency, proving highly beneficial in real-world applications where identity preservation is paramount. Moreover, our work seamlessly integrates with popular pre-trained text-to-image diffusion models like SD1.5 and SDXL, serving as an adaptable plugin. Our codes and pre-trained checkpoints will be available at https://github.com/InstantID/InstantID.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces InstantID, a plug-and-play module for zero-shot identity-preserving image generation with diffusion models. Using a single reference facial image, it proposes IdentityNet to apply strong semantic conditioning on the face image and weak spatial conditioning on landmarks, fused with text prompts inside a frozen SD1.5 or SDXL UNet. The central claim is that this approach achieves high-fidelity personalization across diverse styles without any fine-tuning or multiple references, while remaining compatible with community pre-trained models.
Significance. If the claims are substantiated, the work would offer a practical advance over fine-tuning-heavy methods (DreamBooth, LoRA) and prior ID-embedding approaches by enabling efficient, single-image, zero-shot personalization. The open release of code and pre-trained checkpoints is a clear strength that supports reproducibility and adoption.
major comments (2)
- [Method] Method section (IdentityNet description): The design imposes strong semantic conditions from one face image plus weak landmark cues via cross-attention injection, yet provides no analysis, regularization term, or ablation demonstrating that identity features reliably dominate when text prompts introduce conflicting style semantics; this assumption is load-bearing for the zero-shot high-fidelity guarantee across styles.
- [Experiments] Experiments / Results: The abstract and manuscript assert 'exceptional performance' and 'high fidelity' but contain no quantitative metrics (e.g., face similarity scores, CLIP-based identity preservation), ablation studies on conditioning strength, or direct comparisons against baselines such as IP-Adapter or other single-reference zero-shot methods, preventing verification of the central claims.
minor comments (1)
- [Abstract] Abstract: The statement that the method 'proves highly beneficial' is imprecise; replace with concrete qualifiers tied to the reported efficiency or fidelity gains.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive evaluation of the work's significance. We address the major comments point by point below, agreeing that revisions are needed to provide more rigorous analysis and quantitative evidence.
read point-by-point responses
-
Referee: [Method] Method section (IdentityNet description): The design imposes strong semantic conditions from one face image plus weak landmark cues via cross-attention injection, yet provides no analysis, regularization term, or ablation demonstrating that identity features reliably dominate when text prompts introduce conflicting style semantics; this assumption is load-bearing for the zero-shot high-fidelity guarantee across styles.
Authors: We thank the referee for highlighting this important aspect. The IdentityNet is intentionally designed to apply strong semantic conditioning on the reference face image to ensure identity dominance, with weak spatial conditioning on landmarks to allow flexibility. However, we recognize that the manuscript does not provide explicit ablations or regularization analysis for conflicting text prompts. In the revised manuscript, we will include new experiments demonstrating identity preservation under style conflicts, along with a discussion of the conditioning mechanism. revision: yes
-
Referee: [Experiments] Experiments / Results: The abstract and manuscript assert 'exceptional performance' and 'high fidelity' but contain no quantitative metrics (e.g., face similarity scores, CLIP-based identity preservation), ablation studies on conditioning strength, or direct comparisons against baselines such as IP-Adapter or other single-reference zero-shot methods, preventing verification of the central claims.
Authors: We agree that quantitative metrics would better substantiate our claims of exceptional performance. The current manuscript emphasizes qualitative results to showcase the zero-shot and plug-and-play nature. We will revise the experiments section to include quantitative evaluations, such as face similarity scores using established face recognition models, CLIP-based metrics for identity and style, ablation studies on conditioning strengths, and comparisons with IP-Adapter and similar single-reference methods. revision: yes
Circularity Check
No circularity: InstantID introduces novel IdentityNet conditioning without reducing claims to fitted inputs or self-citations
full rationale
The paper's central contribution is the design of a new IdentityNet module that applies strong semantic conditioning from a single face image and weak spatial cues from landmarks, fused with text prompts inside a frozen diffusion UNet. This is presented as an architectural innovation and plug-and-play component rather than a mathematical derivation, fitted prediction, or result obtained by self-citation. No equations or steps in the described method reduce to prior outputs by construction; the approach relies on explicit new conditioning mechanisms whose performance is evaluated empirically against baselines. The derivation chain is therefore self-contained and independent of the circularity patterns listed.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pre-trained text-to-image diffusion models such as SD1.5 and SDXL can be effectively steered for identity preservation by adding an IdentityNet without modifying base model parameters.
invented entities (1)
-
IdentityNet
no independent evidence
Forward citations
Cited by 19 Pith papers
-
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
-
Follow the Mean: Reference-Guided Flow Matching
Flow matching admits controllable generation by shifting the conditional endpoint mean computed from a reference set, enabling training-free guidance on frozen pretrained models.
-
Follow the Mean: Reference-Guided Flow Matching
Flow matching admits reference-guided control by shifting the conditional endpoint mean, enabling training-free steering of models like FLUX via example banks and a semi-parametric variant on DiT.
-
Adaptive Subspace Projection for Generative Personalization
A training-free adaptive subspace projection method mitigates semantic collapsing in generative personalization by isolating and adjusting drift in a low-dimensional subspace using the stable pre-trained embedding as anchor.
-
StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition
StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.
-
HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation
HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.
-
VACE: All-in-One Video Creation and Editing
VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.
-
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.
-
L2P: Unlocking Latent Potential for Pixel Generation
L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
-
Decoupling Semantics and Fingerprints: A Universal Representation for AI-Generated Image Detection
ODP-Net structurally disentangles universal forgery traces from generator fingerprints and semantics via orthogonal decomposition and purification, delivering state-of-the-art generalization to unseen AI image generat...
-
DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior
DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.
-
PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios
PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional ...
-
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
Scone unifies subject understanding and generation in a two-stage trained model to improve both composition and distinction in multi-subject image generation, outperforming prior open-source models on new benchmarks.
-
How Noise Benefits AI-generated Image Detection
PiN-CLIP jointly trains a noise generator and detector under a variational positive-incentive principle to inject feature-space noise that suppresses shortcut directions and improves out-of-distribution accuracy by 5....
-
Adversarial Concept Distillation for One-Step Diffusion Personalization
OPAD enables reliable high-quality personalization of one-step diffusion models via multi-step teacher distillation combined with adversarial alignment losses.
-
RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation
RealDiffusion uses heat diffusion as a dissipative prior and a region-aware stochastic process inside a training-free physics-informed attention mechanism to improve multi-character coherence while preserving narrativ...
-
When Few Steps Are Enough: Training-Free Acceleration of Identity-Preserved Generation
Frozen identity adapter from FLUX dev works on distilled schnell model, enabling 5.9x faster generation with better identity preservation in few steps.
-
PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards
A data-generation pipeline plus pairwise subject-consistency rewards in RL improve consistency and prompt adherence for multi-subject personalized image generation.
-
AHS: Adaptive Head Synthesis via Synthetic Data Augmentations
Adaptive Head Synthesis (AHS) employs head-reenacted synthetic data augmentation to enable robust head swapping on full upper-body images without paired training data.
Reference graph
Works this paper leans on
-
[1]
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B., et al.: ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
arXiv preprint arXiv:2307.09481 (2023)
Chen, X., Huang, L., Liu, Y., Shen, Y., Zhao, D., Zhao, H.: Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481 (2023)
-
[3]
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. O...
work page 2021
-
[4]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image genera- tion using textual inversion (2022). https://doi.org/10.48550/ARXIV.2208.01618, https://arxiv.org/abs/2208.01618
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2208.01618 2022
-
[5]
Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D.: De- signing an encoder for fast personalization of text-to-image models. arXiv preprint arXiv:2302.12228 (2023)
-
[6]
Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. In: ICLR (2021)
work page 2021
-
[7]
Huang, L., Chen, D., Liu, Y., Shen, Y., Zhao, D., Zhou, J.: Composer: Creative and controllable image synthesis with composable conditions. In: International Confer- ence on Machine Learning (2023), https://api.semanticscholar.org/CorpusID: 257038979
work page 2023
-
[8]
Auto-Encoding Variational Bayes
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Bengio, Y., Le- Cun, Y. (eds.) 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings (2014), http://arxiv.org/abs/1312.6114
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[9]
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Li, Z., Cao, M., Wang, X., Qi, Z., Cheng, M.M., Shan, Y.: Photomaker: Customiz- ing realistic human photos via stacked id embedding (2023)
work page 2023
-
[11]
Mou, C., Wang, X., Xie, L., Zhang, J., Qi, Z., Shan, Y., Qie, X.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and edit- ing with text-guided diffusion models. In: International Conference on Machine Learning (2021), https://api.semanticscholar.org/CorpusID:245335086
work page 2021
-
[13]
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M¨ uller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis (2023)
work page 2023
-
[14]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021) 16 Wang et al
work page 2021
-
[15]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2), 3 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
work page 2022
-
[17]
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, Oc- tober 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015)
work page 2015
-
[18]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
work page 2023
-
[19]
Advances in Neural Information Processing Systems 35, 36479–36494 (2022)
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35, 36479–36494 (2022)
work page 2022
-
[20]
Valevski, D., Wasserman, D., Matias, Y., Leviathan, Y.: Face0: Instantaneously conditioning a text-to-image model on a face (2023)
work page 2023
-
[21]
arXiv preprint arXiv:2307.00040 (2023)
Wang, T., Li, L., Lin, K., Lin, C.C., Yang, Z., Zhang, H., Liu, Z., Wang, L.: Disco: Disentangled control for referring human dance generation in real world. arXiv preprint arXiv:2307.00040 (2023)
-
[22]
arXiv preprint arXiv:2302.13848 (2023)
Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848 (2023)
-
[23]
Yan, Y., Zhang, C., Wang, R., Zhou, Y., Zhang, G., Cheng, P., Yu, G., Fu, B.: Facestudio: Put your face everywhere in seconds (2023)
work page 2023
-
[24]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847 (2023)
work page 2023
-
[26]
Advances in Neural Information Processing Systems (2023)
Zhao, S., Chen, D., Chen, Y.C., Bao, J., Hao, S., Yuan, L., Wong, K.Y.K.: Uni- controlnet: All-in-one control to text-to-image diffusion models. Advances in Neural Information Processing Systems (2023)
work page 2023
-
[27]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Zheng, Y., Yang, H., Zhang, T., Bao, J., Chen, D., Huang, Y., Yuan, L., Chen, D., Zeng, M., Wen, F.: General facial representation learning in a visual-linguistic manner. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18697–18709 (2022)
work page 2022
-
[28]
Zhou, Y., Zhang, R., Sun, T., Xu, J.: Enhancing detail preservation for cus- tomized text-to-image generation: A regularization-free approach. arXiv preprint arXiv:2305.13579 (2023) InstantID: Zero-shot Identity-Preserving Generation in Seconds 17 A Supplementary Details A.1 Implementation Detail In Figure 3, the spatial control, canny image (b) and depth...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.