arxiv: 2510.20512 · v2 · submitted 2025-10-23 · 💻 cs.CV

Adversarial Concept Distillation for One-Step Diffusion Personalization

Yixiong Yang , Tao Wu , Senmao Li , Shiqi Yang , Yaxing Wang , Joost Van De Weijer , Kai Wang This is my paper

Pith reviewed 2026-05-18 04:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords one-step diffusionpersonalizationadversarial distillationteacher-student learningimage generationconcept customizationfast synthesis

0 comments

The pith

Adversarial distillation from multi-step teachers enables reliable personalization of one-step diffusion models for the first time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing methods for customizing text-to-image diffusion models produce severe failures when applied to accelerated one-step versions. The paper introduces a distillation framework in which a one-step student model is trained jointly with a multi-step teacher. Alignment losses keep the student consistent with the teacher while adversarial losses push the student's outputs toward the distribution of real images. This dual supervision produces usable personalized results at single-step speed. The same setup creates a feedback loop in which the student's efficient outputs help refine the teacher model.

Core claim

The central claim is that one-step diffusion models can be personalized reliably by combining teacher-student consistency losses with adversarial losses that align generated images to real distributions. A multi-step teacher guides the student on content and structure while the adversarial component corrects for distribution shift introduced by the single-step approximation. The resulting student preserves generation speed and, through a subsequent collaborative stage, supplies useful signals that improve the teacher.

What carries the argument

The OPAD joint-training loop in which a one-step student receives both consistency losses from a multi-step teacher and adversarial losses against real images, plus a mutual-improvement stage that feeds student outputs back to the teacher.

If this is right

Personalized images become available at single-step inference speed without the quality collapse seen in earlier attempts.
The efficiency advantage of one-step models is retained while personalization quality approaches that of slower multi-step baselines.
A closed feedback loop between student and teacher yields iterative gains in both models.
Adversarial alignment proves sufficient to compensate for the information loss that occurs when reducing denoising steps from many to one.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same adversarial-distillation pattern may transfer to other single-step generative architectures such as consistency models or distilled flows.
Real-time on-device personalization applications become practical once single-step inference is paired with this training recipe.
The collaborative stage suggests a general route for using fast student models to accelerate improvement of their slower teachers.

Load-bearing premise

Adversarial losses can be balanced so they close the distribution gap between one-step outputs and real images without creating new artifacts or training instability.

What would settle it

A side-by-side evaluation on standard personalization benchmarks in which OPAD produces the same severe artifacts or mode-collapse cases reported for prior one-step methods would falsify the reliability claim.

Figures

Figures reproduced from arXiv: 2510.20512 by Joost Van De Weijer, Kai Wang, Senmao Li, Shiqi Yang, Tao Wu, Yaxing Wang, Yixiong Yang.

**Figure 2.** Figure 2: Overview of EchoDistill. The student and teacher jointly learn the new concept with a shared text encoder. The teacher learns from real images x r 0 (green line), and the text encoder is updated accordingly. The student is optimized with two objectives (gold line): an adversarial loss to match real data distribution and alignment losses to match the denoised outputs of the teacher. The discriminators are t… view at source ↗

**Figure 3.** Figure 3: Our method EchoDistill (last column) compared with existing methods applied to the 1-SDP setup with SDTurbo [73] as the one-step diffusion backbone. One representive concept image is shown on the left-most column. 3.3.2 Echoing Stage: Student improves the Teacher We interpret the one-step student model as a GAN-like generator, and hypothesize that it can benefit from aligning with the few-shot target data … view at source ↗

**Figure 4.** Figure 4: (a) Abating the Infer-NFEs; (b) Ablating the one-step diffusion backbone; (c) Ablating the [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: The qualitative results of the 1-shot performance. [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: Our method EchoDistill (last column) compared with existing methods applied to the 1-SDP setup with SDTurbo [73] as the one-step diffusion backbone. One representive concept image is shown on the left-most column. (Part 1) 25 [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: Our method EchoDistill (last column) compared with existing methods applied to the 1-SDP setup with SDTurbo [73] as the one-step diffusion backbone. One representive concept image is shown on the left-most column. (Part 2) 26 [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Our method EchoDistill (last column) compared with existing methods applied to the 1-SDP setup with SDTurbo [73] as the one-step diffusion backbone. One representive concept image is shown on the left-most column. (Part 3) 27 [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Our method EchoDistill (last column) compared with existing methods applied to the 1-SDP setup with SDTurbo [73] as the one-step diffusion backbone. One representive concept image is shown on the left-most column. (Part 4) 28 [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative results of the extended ablation study. [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative results of EchoDistill on the CustomConcept101 dataset. Our method demonstrates strong generalization across a variety of concepts and prompt styles. (Part 1) 30 [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative results of EchoDistill on the CustomConcept101 dataset. Our method demonstrates strong generalization across a variety of concepts and prompt styles. (Part 2). 31 [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗

read the original abstract

Recent progress in accelerating text-to-image diffusion models enables high-fidelity synthesis within a single denoising step. However, customizing the fast one-step models remains challenging, as existing methods consistently fail to produce acceptable results, underscoring the need for new methodologies to personalize one-step models. Therefore, we propose One-step Personalized Adversarial Distillation (OPAD), a framework that combines teacher-student distillation with adversarial supervision. A multi-step diffusion model serves as the teacher, while a one-step student model is jointly trained with it. The student learns from alignment losses that preserve consistency with the teacher and from adversarial losses that align its output with real image distributions. Beyond one-step personalization, we further observe that the student's efficient generation and adversarially enriched representations provide valuable feedback to improve the teacher model, forming a collaborative learning stage. Extensive experiments demonstrate that OPAD is the first approach to deliver reliable, high-quality personalization for one-step diffusion models; in contrast, prior methods largely fail and produce severe failure cases, while OPAD preserves single-step efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OPAD adds adversarial supervision to teacher-student distillation for one-step diffusion personalization and claims a feedback loop back to the teacher, but the stability of that adversarial signal in the single-step regime is the part that still needs tighter evidence.

read the letter

The main point is that this paper targets a practical bottleneck: standard personalization techniques break down when you force diffusion models down to one denoising step, and OPAD tries to fix it by training a one-step student against a multi-step teacher using both consistency losses and adversarial losses that push outputs toward real-image distributions. They also add a collaborative stage where the fast student gives feedback that supposedly improves the teacher. That combination and the feedback loop are the concrete novelties relative to prior distillation work. The paper does a reasonable job laying out why existing methods produce severe artifacts in the one-step case and why keeping single-step inference matters for deployment. If the experiments include direct side-by-side visuals and some quantitative metrics on identity preservation and image quality, that part is useful for people who actually need fast customization. The soft spot is exactly the one the stress-test flags. One-step models lack the iterative correction that usually keeps adversarial training from drifting into artifacts or mode collapse, and the joint training setup does not automatically guarantee that the adversarial term stays well-behaved. The balancing weights between alignment and adversarial losses are free parameters, so results could be sensitive to tuning. Without clear ablations that isolate the adversarial component and show it does not introduce new failure modes, the central claim that OPAD is the first reliable method rests on thinner ground than the abstract suggests. This is for researchers working on efficient generative models and on-device personalization. A reader who cares about latency-sensitive creative tools would get practical value from the method description and the efficiency claims. The work shows clear thinking about the problem and honest engagement with the limitations of prior approaches, so it deserves a serious referee even if the experiments need strengthening on the adversarial stability question. Recommendation: send it to review with a request for more targeted ablations on the adversarial loss.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes One-step Personalized Adversarial Distillation (OPAD), a framework that combines teacher-student distillation with adversarial supervision for personalizing one-step text-to-image diffusion models. A multi-step teacher guides a one-step student via alignment (consistency) losses and adversarial losses that align student outputs with real-image distributions; a subsequent collaborative stage lets the student provide feedback to improve the teacher. The central claim is that OPAD is the first method to achieve reliable high-quality personalization in the one-step regime, where prior approaches produce severe failures, while preserving single-step efficiency.

Significance. If the experimental claims hold, the work addresses a practically important gap: one-step diffusion models offer substantial inference speedups but have resisted personalization. The collaborative feedback loop between student and teacher is a distinctive element that could generalize beyond this setting. The approach builds on established distillation and adversarial techniques but applies them in a new one-step context with an efficiency-preserving feedback mechanism.

major comments (2)

[Abstract] Abstract and experimental claims: the assertion that 'prior methods largely fail and produce severe failure cases' while OPAD succeeds is load-bearing for the central contribution, yet the provided description supplies no quantitative metrics (FID, CLIP similarity, user-study scores, or ablation tables) or failure-case analysis to support the magnitude of improvement or to isolate the adversarial component's contribution.
[Method (adversarial and collaborative stages)] Adversarial alignment mechanism: the claim that adversarial losses reliably align the one-step student's output distribution to real images without artifacts or instability rests on the joint training with consistency losses, but the one-step setting lacks the iterative refinement that normally stabilizes such training; concrete details on loss weighting, training dynamics, or safeguards against mode collapse (e.g., in the loss formulation or optimization procedure) are required to substantiate the weakest assumption.

minor comments (2)

[Training details] Clarify how the balancing weights between alignment and adversarial losses are chosen or scheduled, as they appear among the free parameters.
[Experiments] Add explicit ablation isolating the collaborative feedback stage to demonstrate its contribution to teacher improvement without compromising student efficiency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and experimental claims: the assertion that 'prior methods largely fail and produce severe failure cases' while OPAD succeeds is load-bearing for the central contribution, yet the provided description supplies no quantitative metrics (FID, CLIP similarity, user-study scores, or ablation tables) or failure-case analysis to support the magnitude of improvement or to isolate the adversarial component's contribution.

Authors: We thank the referee for highlighting this. The full manuscript includes comprehensive quantitative evaluations with FID, CLIP similarity, user-study scores, ablation tables, and failure-case visualizations that support the claims and isolate the adversarial component's contribution. To strengthen the abstract, we will revise it to incorporate key metrics and explicitly reference the relevant experimental sections and figures. revision: yes
Referee: [Method (adversarial and collaborative stages)] Adversarial alignment mechanism: the claim that adversarial losses reliably align the one-step student's output distribution to real images without artifacts or instability rests on the joint training with consistency losses, but the one-step setting lacks the iterative refinement that normally stabilizes such training; concrete details on loss weighting, training dynamics, or safeguards against mode collapse (e.g., in the loss formulation or optimization procedure) are required to substantiate the weakest assumption.

Authors: We agree that additional concrete details will improve clarity. In the revised manuscript we will expand the method section to specify the exact loss weighting coefficients, describe the observed training dynamics and convergence behavior, and detail safeguards including the stabilizing role of consistency losses, spectral normalization on the discriminator, gradient penalty terms, and alternating optimization to mitigate mode collapse. These elements are already implemented in our training procedure and contribute to the observed stability. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes OPAD as a framework that combines established teacher-student distillation with adversarial supervision for one-step model personalization. The abstract describes alignment losses preserving consistency with a multi-step teacher and adversarial losses aligning student outputs to real image distributions, plus a collaborative feedback stage. No equations, derivations, or self-referential definitions are shown that would reduce the claimed results or performance to inputs by construction. The central claims rest on empirical experiments rather than tautological reductions or load-bearing self-citations, rendering the method self-contained against external benchmarks in distillation and adversarial training.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on standard machine-learning assumptions about distillation and adversarial training rather than new physical or mathematical entities.

free parameters (1)

balancing weights between alignment and adversarial losses
Typical hyper-parameters in joint distillation-adversarial training that must be chosen to stabilize the student-teacher loop.

axioms (1)

domain assumption A multi-step diffusion model can serve as an effective teacher whose outputs provide useful supervision for a one-step student.
Invoked when the abstract states the student learns from alignment losses with the teacher.

pith-pipeline@v0.9.0 · 5726 in / 1166 out tokens · 45959 ms · 2026-05-18T04:46:48.488413+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

115 extracted references · 115 canonical work pages · 6 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Stellar: systematic evaluation of human-centric personalized text-to-image methods.arXiv preprint arXiv:2312.06116, 2023

Panos Achlioptas, Alexandros Benetatos, Iordanis Fostiropoulos, and Dimitris Skourtis. Stellar: systematic evaluation of human-centric personalized text-to-image methods.arXiv preprint arXiv:2312.06116, 2023

work page arXiv 2023
[3]

An image is worth multiple words: Multi-attribute inversion for constrained text-to-image synthesis

Aishwarya Agarwal, Srikrishna Karanam, Tripti Shukla, and Balaji Vasan Srinivasan. An image is worth multiple words: Multi-attribute inversion for constrained text-to-image synthesis. International Conference on Machine Learning, 2024

work page 2024
[4]

A neural space-time representation for text-to-image personalization.ACM Transactions on Graphics (TOG), 42(6):1–10, 2023

Yuval Alaluf, Elad Richardson, Gal Metzer, and Daniel Cohen-Or. A neural space-time representation for text-to-image personalization.ACM Transactions on Graphics (TOG), 42(6):1–10, 2023

work page 2023
[5]

Moab Arar, Rinon Gal, Yuval Atzmon, Gal Chechik, Daniel Cohen-Or, Ariel Shamir, and Amit H. Bermano. Domain-agnostic tuning-encoder for fast personalization of text-to-image models. InSIGGRAPH Asia 2023 Conference Papers, pages 1–10, 2023

work page 2023
[6]

Kandinsky 3.0 technical report.arXiv preprint arXiv:2312.03511, 2023

Vladimir Arkhipkin, Andrei Filatov, Viacheslav Vasilev, Anastasia Maltseva, Said Azizov, Igor Pavlov, Julia Agafonova, Andrey Kuznetsov, and Denis Dimitrov. Kandinsky 3.0 technical report.arXiv preprint arXiv:2312.03511, 2023. 10

work page arXiv 2023
[7]

Break-a- scene: Extracting multiple concepts from a single image.SIGGRAPH Asia 2023, 2023

Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. Break-a- scene: Extracting multiple concepts from a single image.SIGGRAPH Asia 2023, 2023

work page 2023
[8]

Colorpeel: Color prompt learning with diffusion models via color and shape disentanglement

Muhammad Atif Butt, Kai Wang, Javier Vazquez-Corral, and Joost van de Weijer. Colorpeel: Color prompt learning with diffusion models via color and shape disentanglement. InEuropean Conference on Computer Vision, 2024

work page 2024
[9]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

work page 2021
[10]

Efficient geometry- aware 3d generative adversarial networks

Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry- aware 3d generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16123–16133, 2022

work page 2022
[11]

Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation

Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, and Xinglong Wu. Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation. arXiv preprint arXiv:2506.21416, 2025

work page arXiv 2025
[12]

Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation.International Conference on Learning Representations, 2024

Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou, and Wenwu Zhu. Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation.International Conference on Learning Representations, 2024

work page 2024
[13]

Subject-driven text-to-image generation via apprenticeship learning

Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Rui, Xuhui Jia, Ming-Wei Chang, and William W Cohen. Subject-driven text-to-image generation via apprenticeship learning. Advances in Neural Information Processing Systems, 2023

work page 2023
[14]

Re-imagen: Retrieval- augmented text-to-image generator

Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval- augmented text-to-image generator. InThe Eleventh International Conference on Learning Representations, 2022

work page 2022
[15]

Fine-tuning visual autoregressive models for subject-driven generation.Proceedings of the International Conference on Computer Vision, 2025

Jiwoo Chung, Sangeek Hyun, Hyunjun Kim, Eunseo Koh, MinKyu Lee, and Jae-Pil Heo. Fine-tuning visual autoregressive models for subject-driven generation.Proceedings of the International Conference on Computer Vision, 2025

work page 2025
[16]

Idadapter: Learning mixed features for tuning-free personalization of text-to-image models

Siying Cui, Jia Guo, Xiang An, Jiankang Deng, Yongle Zhao, Xinyu Wei, and Ziyong Feng. Idadapter: Learning mixed features for tuning-free personalization of text-to-image models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 950–959, 2024

work page 2024
[17]

Swiftbrush v2: Make your one-step diffusion model better than its teacher.European Conference on Computer Vision, 2024

Trung Dao, Thuan Hoang Nguyen, Thanh Le, Duc Vu, Khoi Nguyen, Cuong Pham, and Anh Tran. Swiftbrush v2: Make your one-step diffusion model better than its teacher.European Conference on Computer Vision, 2024

work page 2024
[18]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Freecustom: Tuning-free customized image generation for multi-concept composition

Ganggui Ding, Canyu Zhao, Wen Wang, Zhen Yang, Zide Liu, Hao Chen, and Chunhua Shen. Freecustom: Tuning-free customized image generation for multi-concept composition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9089–9098, 2024

work page 2024
[20]

Dreamartist: Towards controllable one-shot text-to- image generation via contrastive prompt-tuning.arXiv preprint arXiv:2211.11337, 2022

Ziyi Dong, Pengxu Wei, and Liang Lin. Dreamartist: Towards controllable one-shot text-to- image generation via contrastive prompt-tuning.arXiv preprint arXiv:2211.11337, 2022

work page arXiv 2022
[21]

An image is worth one word: Personalizing text-to-image generation using textual inversion.International Conference on Learning Representations, 2023

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion.International Conference on Learning Representations, 2023

work page 2023
[22]

Designing an encoder for fast personalization of text-to-image models.arXiv preprint arXiv:2302.12228, 2023

Rinon Gal, Moab Arar, Yuval Atzmon, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. Designing an encoder for fast personalization of text-to-image models.arXiv preprint arXiv:2302.12228, 2023. 11

work page arXiv 2023
[23]

Lcm-lookahead for encoder-based text-to-image personalization

Rinon Gal, Or Lichter, Elad Richardson, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Lcm-lookahead for encoder-based text-to-image personalization. In European Conference on Computer Vision, pages 322–340. Springer, 2024

work page 2024
[24]

Tokenverse: Versatile multi-concept personalization in token modulation space.Proceedings of the International Conference on Computer Vision, 2025

Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, and Tali Dekel. Tokenverse: Versatile multi-concept personalization in token modulation space.Proceedings of the International Conference on Computer Vision, 2025

work page 2025
[25]

Mix-of-show: Decentralized low-rank adap- tation for multi-concept customization of diffusion models.Advances in Neural Information Processing Systems, 36, 2024

Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, et al. Mix-of-show: Decentralized low-rank adap- tation for multi-concept customization of diffusion models.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[26]

Pulid: Pure and lightning id customization via contrastive alignment.Advances in neural information processing systems, 37:36777–36804, 2024

Zinan Guo, Yanze Wu, Chen Zhuowei, Peng Zhang, Qian He, et al. Pulid: Pure and lightning id customization via contrastive alignment.Advances in neural information processing systems, 37:36777–36804, 2024

work page 2024
[27]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[28]

Svdiff: Compact parameter space for diffusion fine-tuning.Proceedings of the International Conference on Computer Vision, 2023

Ligong Han, Yinxiao Li, Han Zhang, Peyman Milanfar, Dimitris Metaxas, and Feng Yang. Svdiff: Compact parameter space for diffusion fine-tuning.Proceedings of the International Conference on Computer Vision, 2023

work page 2023
[29]

Multiscale sliced wasserstein distances as perceptual color difference measures

Jiaqi He, Zhihua Wang, Leon Wang, Tsein-I Liu, Yuming Fang, Qilin Sun, and Kede Ma. Multiscale sliced wasserstein distances as perceptual color difference measures. InEuropean Conference on Computer Vision, pages 425–442. Springer, 2024

work page 2024
[30]

Prompt-to-prompt image editing with cross attention control.International Conference on Learning Representations, 2023

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.International Conference on Learning Representations, 2023

work page 2023
[31]

Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

work page 2020
[32]

Classdiffusion: More aligned personalization tuning with explicit class guidance

Jiannan Huang, Jun Hao Liew, Hanshu Yan, Yuyang Yin, Yao Zhao, Humphrey Shi, and Yunchao Wei. Classdiffusion: More aligned personalization tuning with explicit class guidance. International Conference on Learning Representations, 2025

work page 2025
[33]

Huang, X

Jiehui Huang, Xiao Dong, Wenhui Song, Hanhui Li, Jun Zhou, Yuhao Cheng, Shutao Liao, Long Chen, Yiqiang Yan, Shengcai Liao, et al. Consistentid: Portrait generation with multi- modal fine-grained identity preserving.arXiv preprint arXiv:2404.16771, 2024

work page arXiv 2024
[34]

Resolving multi-condition confusion for finetuning-free personalized image generation.Proceedings of the Conference on Artificial Intelligence, 2025

Qihan Huang, Siming Fu, Jinlong Liu, Hao Jiang, Yipeng Yu, and Jie Song. Resolving multi-condition confusion for finetuning-free personalized image generation.Proceedings of the Conference on Artificial Intelligence, 2025

work page 2025
[35]

Taming encoder for zero fine-tuning image customization with text-to-image diffusion models.arXiv preprint arXiv:2304.02642, 2023

Xuhui Jia, Yang Zhao, Kelvin CK Chan, Yandong Li, Han Zhang, Boqing Gong, Tingbo Hou, Huisheng Wang, and Yu-Chuan Su. Taming encoder for zero fine-tuning image customization with text-to-image diffusion models.arXiv preprint arXiv:2304.02642, 2023

work page arXiv 2023
[36]

Infiniteyou: Flexible photo recrafting while preserving your identity.Proceedings of the International Conference on Computer Vision, 2025

Liming Jiang, Qing Yan, Yumin Jia, Zichuan Liu, Hao Kang, and Xin Lu. Infiniteyou: Flexible photo recrafting while preserving your identity.Proceedings of the International Conference on Computer Vision, 2025

work page 2025
[37]

Omg: Occlusion-friendly personalized multi-concept generation in diffusion models

Zhe Kong, Yong Zhang, Tianyu Yang, Tao Wang, Kaihao Zhang, Bizhu Wu, Guanying Chen, Wei Liu, and Wenhan Luo. Omg: Occlusion-friendly personalized multi-concept generation in diffusion models. InEuropean Conference on Computer Vision, pages 253–270. Springer, 2024

work page 2024
[38]

Generating multi- image synthetic data for text-to-image customization.arXiv preprint arXiv:2502.01720, 2025

Nupur Kumari, Xi Yin, Jun-Yan Zhu, Ishan Misra, and Samaneh Azadi. Generating multi- image synthetic data for text-to-image customization.arXiv preprint arXiv:2502.01720, 2025. 12

work page arXiv 2025
[39]

Multi- concept customization of text-to-image diffusion.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi- concept customization of text-to-image diffusion.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[40]

Ensembling off-the-shelf models for gan training

Nupur Kumari, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Ensembling off-the-shelf models for gan training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10651–10662, 2022

work page 2022
[41]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[42]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.International Conference on Machine Learning, 2023

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.International Conference on Machine Learning, 2023

work page 2023
[43]

Photomaker: Customizing realistic human photos via stacked id embedding.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming-Ming Cheng, and Ying Shan. Photomaker: Customizing realistic human photos via stacked id embedding.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[44]

A comprehensive survey on visual concept mining in text-to-image diffusion models.arXiv preprint arXiv:2503.13576, 2025

Ziqiang Li, Jun Li, Lizhi Xiong, Zhangjie Fu, and Zechao Li. A comprehensive survey on visual concept mining in text-to-image diffusion models.arXiv preprint arXiv:2503.13576, 2025

work page arXiv 2025
[45]

Distilled decoding 1: One-step sampling of image auto-regressive models with flow matching.International Conference on Learning Representations, 2025

Enshu Liu, Xuefei Ning, Yu Wang, and Zinan Lin. Distilled decoding 1: One-step sampling of image auto-regressive models with flow matching.International Conference on Learning Representations, 2025

work page 2025
[46]

Cones: Concept neurons in diffusion models for customized generation

Zhiheng Liu, Ruili Feng, Kai Zhu, Yifei Zhang, Kecheng Zheng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang Cao. Cones: Concept neurons in diffusion models for customized generation. International Conference on Machine Learning, 2023

work page 2023
[47]

Customizable image synthesis with multiple subjects

Zhiheng Liu, Yifei Zhang, Yujun Shen, Kecheng Zheng, Kai Zhu, Ruili Feng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang Cao. Customizable image synthesis with multiple subjects. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[48]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

work page 2022
[49]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Lcm-lora: A universal stable-diffusion acceleration module

Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolinário Passos, Longbo Huang, Jian Li, and Hang Zhao. Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2311.05556, 2023

work page arXiv 2023
[51]

Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning.Proceedings of the ACM SIGGRAPH Conference on Computer Graphics, 2024

Jian Ma, Junhao Liang, Chen Chen, and Haonan Lu. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning.Proceedings of the ACM SIGGRAPH Conference on Computer Graphics, 2024

work page 2024
[52]

Overview of intelligent video coding: from model-based to learning-based approaches

Siwei Ma, Junlong Gao, Ruofan Wang, Jianhui Chang, Qi Mao, Zhimeng Huang, and Chuan- min Jia. Overview of intelligent video coding: from model-based to learning-based approaches. Visual Intelligence, 1(1):15, 2023

work page 2023
[53]

Unified multi-modal latent diffusion for joint subject and text conditional image generation.arXiv preprint arXiv:2303.09319, 2023

Yiyang Ma, Huan Yang, Wenjing Wang, Jianlong Fu, and Jiaying Liu. Unified multi-modal latent diffusion for joint subject and text conditional image generation.arXiv preprint arXiv:2303.09319, 2023

work page arXiv 2023
[54]

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models.Proceedings of the Conference on Artificial Intelligence, 2024

Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models.Proceedings of the Conference on Artificial Intelligence, 2024

work page 2024
[55]

Dreamo: A unified framework for image customization.SIGGRAPH Asia, 2025

Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al. Dreamo: A unified framework for image customization.SIGGRAPH Asia, 2025. 13

work page 2025
[56]

Swiftbrush: One-step text-to-image diffusion model with variational score distillation.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

Thuan Hoang Nguyen and Anh Tran. Swiftbrush: One-step text-to-image diffusion model with variational score distillation.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[57]

Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Laba...

work page 2023
[58]

Kosmos- g: Generating images in context with multimodal large language models.International Conference on Learning Representations, 2024

Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, and Furu Wei. Kosmos- g: Generating images in context with multimodal large language models.International Conference on Learning Representations, 2024

work page 2024
[59]

Attndreambooth: Towards text-aligned personalized text-to-image generation.Advances in Neural Information Processing Systems, 37:39869–39900, 2024

Lianyu Pang, Jian Yin, Baoquan Zhao, Feize Wu, Fu Lee Wang, Qing Li, and Xudong Mao. Attndreambooth: Towards text-aligned personalized text-to-image generation.Advances in Neural Information Processing Systems, 37:39869–39900, 2024

work page 2024
[60]

Textboost: Towards one-shot personaliza- tion of text-to-image models via fine-tuning text encoder.arXiv preprint arXiv:2409.08248, 2024

NaHyeon Park, Kunhee Kim, and Hyunjung Shim. Textboost: Towards one-shot personaliza- tion of text-to-image models via fine-tuning text encoder.arXiv preprint arXiv:2409.08248, 2024

work page arXiv 2024
[61]

lambda-eclipse: Multi-concept personalized text-to-image diffusion models by leveraging clip latent space.arXiv preprint arXiv:2402.05195, 2024

Maitreya Patel, Sangmin Jung, Chitta Baral, and Yezhou Yang. lambda-eclipse: Multi-concept personalized text-to-image diffusion models by leveraging clip latent space.arXiv preprint arXiv:2402.05195, 2024

work page arXiv 2024
[62]

Orthogonal adaptation for modular customization of diffusion models

Ryan Po, Guandao Yang, Kfir Aberman, and Gordon Wetzstein. Orthogonal adaptation for modular customization of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7964–7973, 2024

work page 2024
[63]

Barron, and Ben Mildenhall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. DreamFusion: Text-to-3D using 2D Diffusion.International Conference on Learning Representations, 2023

work page 2023
[64]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021
[65]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 06 2022

work page 2022
[66]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015

work page 2015
[67]

Ipadapter-instruct: Resolving ambiguity in image-based conditioning using instruct prompts, 2024

Ciara Rowles, Shimon Vainer, Dante De Nigris, Slava Elizarov, Konstantin Kutsy, and Simon Donné. Ipadapter-instruct: Resolving ambiguity in image-based conditioning using instruct prompts, 2024

work page 2024
[68]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aber- man. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023

work page 2023
[69]

Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models.arXiv preprint arXiv:2307.06949, 2023

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wad- hwa, Michael Rubinstein, and Kfir Aberman. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models.arXiv preprint arXiv:2307.06949, 2023

work page arXiv 2023
[70]

Align your flow: Scaling continuous-time flow map distillation

Amirmojtaba Sabour, Sanja Fidler, and Karsten Kreis. Align your flow: Scaling continuous- time flow map distillation.arXiv preprint arXiv:2506.14603, 2025. 14

work page arXiv 2025
[71]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in Neural Information Processing Systems, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in Neural Information Processing Systems, 2022

work page 2022
[72]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. International Conference on Learning Representations, 2022

work page 2022
[73]

Adversarial diffusion distillation.European Conference on Computer Vision, 2024

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation.European Conference on Computer Vision, 2024

work page 2024
[74]

Ziplora: Any subject in any style by effectively merging loras.arXiv preprint arXiv:2311.13600, 2023

Viraj Shah, Nataniel Ruiz, Forrester Cole, Erika Lu, Svetlana Lazebnik, Yuanzhen Li, and Varun Jampani. Ziplora: Any subject in any style by effectively merging loras.arXiv preprint arXiv:2311.13600, 2023

work page arXiv 2023
[75]

Instantbooth: Personalized text-to-image generation without test-time finetuning.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instantbooth: Personalized text-to-image generation without test-time finetuning.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[76]

Loraclr: Contrastive adaptation for customization of diffusion models.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

Enis Simsar, Thomas Hofmann, Federico Tombari, and Pinar Yanardag. Loraclr: Contrastive adaptation for customization of diffusion models.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[77]

Continual diffusion: Continual customization of text-to-image diffusion with c-lora.arXiv preprint arXiv:2304.06027, 2023

James Seale Smith, Yen-Chang Hsu, Lingyu Zhang, Ting Hua, Zsolt Kira, Yilin Shen, and Hongxia Jin. Continual diffusion: Continual customization of text-to-image diffusion with c-lora.arXiv preprint arXiv:2304.06027, 2023

work page arXiv 2023
[78]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021

work page 2021
[79]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, pages 32211–32252. PMLR, 2023

work page 2023
[80]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

Showing first 80 references.