pith. sign in

arxiv: 2605.25378 · v2 · pith:YSKTPY4Vnew · submitted 2026-05-25 · 💻 cs.CV · cs.AI

CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation

Pith reviewed 2026-06-29 22:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords LoRAdiffusion modelsimage editingmulti-teacher distillationconcept isolationfew-step generationmodel mergingcustomized effects
0
0 comments X

The pith

A single LoRA can absorb concepts from up to 50 separate effect adapters plus few-step generation without interference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that the visual effects from dozens of specialized LoRAs can be merged into one adapter through multi-teacher distillation. A sympathetic reader would care because this would eliminate the storage burden and switching overhead that currently limits widespread use of customized diffusion models for image editing. The work introduces routing between data sources, prompt-space isolation, and staged distillation losses to keep the effects distinct while closing the gap to the original teachers.

Core claim

CollectionLoRA is a multi-teacher on-policy distillation framework capable of distilling the concepts of up to 50 different effect LoRAs along with few-step generation capabilities into a single LoRA. This fundamentally resolves the feature interference issue and significantly reduces deployment costs. Specifically, the method introduces a Probabilistic Dual-Stream Routing mechanism that enables the model to randomly switch between data sources during training, an Asymmetric Orthogonal Prompting strategy to achieve concept isolation within the prompt space, and a Coarse-to-Fine Distillation Objective to mitigate the distribution gap between the teacher and student models.

What carries the argument

Multi-teacher on-policy distillation using Probabilistic Dual-Stream Routing to switch data sources, Asymmetric Orthogonal Prompting for concept isolation, and Coarse-to-Fine Distillation Objective to close teacher-student gaps.

If this is right

  • One adapter replaces many separate effect LoRAs, lowering storage and loading overhead during deployment.
  • Concept fidelity stays comparable to or better than the independent teacher models.
  • Few-step generation is retained inside the same adapter without cascading separate acceleration modules.
  • Random switching between data sources improves generalization on prompts outside the training set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation pattern could be tried on adapters for tasks other than visual effects, such as style or subject control.
  • Production systems that switch models frequently might see lower latency once multiple capabilities live in one file.
  • If the isolation mechanisms hold, the approach could be tested with more than 50 effects to check scaling limits.

Load-bearing premise

The three introduced components together suffice to isolate concepts and close the teacher-student distribution gap without requiring post-hoc data filtering or hyperparameter choices that affect the reported fidelity gains.

What would settle it

A side-by-side test of the single LoRA against the original separate teachers on prompts that request two or more effects at once, checking whether concept bleeding or quality drop appears in the outputs.

Figures

Figures reproduced from arXiv: 2605.25378 by Fangtai Wu, Hailong Guo, Jiaming Liu, Jiayi Song, Mushui Liu, Ruihua Huang, Shijie Huang, Yubo Huang, Yunlong Yu, Zhao Wang.

Figure 1
Figure 1. Figure 1: We propose CollectionLoRA, a multi-teacher distillation framework capable of consolidating diverse effects and few-step inference capabilities into a single LoRA. Abstract. Customized image editing aims to equip pre-trained diffu￾sion models with specific visual effects using limited paired data, typi￾cally via Low-Rank Adaptation (LoRA). As the number of desired ef￾fects grows, storing and dynamically loa… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between conventional multi-LoRA pipelines and the proposed Col￾lectionLoRA. Conventional: Training and deploying task-specific LoRA weights for each concept, which are sequentially composed with an acceleration LoRA during infer￾ence. CollectionLoRA: Consolidating the acceleration prior and all target concepts into a single unified module through multi-teacher distillation. to prevent distributi… view at source ↗
Figure 3
Figure 3. Figure 3: The overall framework of CollectionLoRA. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effectiveness of C2F-DO. (a) Directly applying standard DMD to multi￾teacher distillation causes the student distribution to collapse into an intermediate state. (b) Relying solely on trajectory anchoring leads to detail loss in high-frequency features, whereas incorporating target simulation effectively restores realistic micro￾scopic details. 4.3 Asymmetric Orthogonal Prompting To mitigate feature interf… view at source ↗
Figure 5
Figure 5. Figure 5: Evaluation of subject consistency metrics. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of CollectionLoRA against baseline methods. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Zero-shot effect composition capability of CollectionLoRA. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative ablation study. The progressive integration of our core compo￾nents systematically resolves semantic collapse, restores high-frequency textures, and ensures strict structural consistency. 5.4 Ablation Study Quantitative Performance Analysis. Under a 50-in-1 concurrent distillation setting (Tab. 3), our ablation validates each component: AOP mitigates concept bleeding, reducing BCR from 0.378 to… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative ablation of training dynamics. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Quantitative ablation of training dynamics. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The prompt template used for querying the Multimodal Large Language Model (MLLM) to evaluate the BCR metric [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The prompt template used for querying the MLLM to quantitatively evaluate the consistency score (1-5) of the generated image [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison between (a) Backward Simulation and (b) our proposed Target Simulation. In the heterogeneous distillation setting, Backward Simulation can yield nearly identical ’real’ and ’fake’ predictions due to severe domain deviation, leading to the vanishing gradient problem. Conversely, Target Simulation provides distinct representations, ensuring informative gradients for the student model. fake real f… view at source ↗
Figure 14
Figure 14. Figure 14: Comparison of simulation strategies. (a) Backward simulation leads to vanishing gradients. (b) Target simulation enables differentiation. (c) Time-step con￾straints further amplify the discrepancy, providing robust gradient signals for effective training. 8.2 Ablation Study on Timestep-Constrained Target Simulation Upon the introduction of Target Simulation, we further incorporate a time-step constraint. … view at source ↗
Figure 15
Figure 15. Figure 15: The prompt template used for refining and enriching the generic editing prompt based on visual samples and a baseline description. 35.0% 8.7% 6.4% 49.9% Visual Quality 25.4% 7.0%1.4% 66.2% Consistency 41.4% 3.3%1.4% 53.9% Style Alignment Base Base + Lightning 50 in 1 (FM) + Lightning 50 in 1 (Ours) [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Detailed user study results. Evaluators were asked to choose the best result among four candidates across three dimensions: Visual Quality, Consistency, and Style Alignment. Our proposed method, 50 in 1 (Ours), consistently achieves the highest preference in all categories. Most notably, it secures 66.2% of the votes for Consistency and 53.9% for Style Alignment, significantly outperforming the Base model… view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative Evaluation [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative Evaluation [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Qualitative Evaluation [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Qualitative Evaluation [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Qualitative Evaluation [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Qualitative Evaluation [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗
read the original abstract

Customized image editing aims to equip pre-trained diffusion models with specific visual effects using limited paired data, typically via Low-Rank Adaptation (LoRA). As the number of desired effects grows, storing and dynamically loading numerous these effect LoRAs significantly increases deployment overhead. Furthermore, current pipelines typically cascade these effect LoRAs with acceleration modules for fast generation, which triggers severe parameter interference and results in concept bleeding and style degradation. We propose CollectionLoRA, a multi-teacher on-policy distillation framework capable of distilling the concepts of up to 50 different effect LoRAs along with few-step generation capabilities into a single LoRA. This fundamentally resolves the feature interference issue and significantly reduces deployment costs. Specifically, the method introduces (i) a Probabilistic Dual-Stream Routing mechanism that enables the model to randomly switch between data sources during training, effectively enhancing its generalization in unseen scenarios; (ii) an Asymmetric Orthogonal Prompting strategy to achieve concept isolation within the prompt space; (iii) a Coarse-to-Fine Distillation Objective to mitigate the distribution gap between the teacher and student models. Extensive evaluations show that CollectionLoRA distills all customized effects and few-step generation into a single LoRA, reducing deployment overhead while achieving concept fidelity comparable to or better than independently trained teacher models. Code: https://github.com/Qwen-Applications/CollectionLoRA

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes CollectionLoRA, a multi-teacher on-policy distillation method that consolidates concepts from up to 50 effect-specific LoRAs plus few-step generation into one LoRA for customized image editing in diffusion models. It introduces Probabilistic Dual-Stream Routing to improve generalization via random data-source switching, Asymmetric Orthogonal Prompting for prompt-space concept isolation, and a Coarse-to-Fine Distillation Objective to reduce teacher-student distribution gaps. The central claim is that these components together eliminate feature interference from cascaded LoRAs, achieve fidelity comparable or superior to separate teacher models, and reduce deployment overhead.

Significance. If the empirical support holds, the work would provide a concrete engineering route to scaling multi-effect customization without multiplicative storage or interference costs, which is a practical bottleneck in LoRA-based diffusion pipelines. The on-policy multi-teacher framing and the three listed mechanisms constitute a targeted contribution to parameter-efficient adaptation at scale.

major comments (2)
  1. [Abstract] Abstract: the assertion that 'extensive evaluations show' comparable or better fidelity and successful distillation of 50 effects supplies no metrics, baselines, dataset details, ablation tables, or quantitative results, leaving the central claim without visible load-bearing evidence.
  2. [Method] Method (components i–iii): the claim that Probabilistic Dual-Stream Routing, Asymmetric Orthogonal Prompting, and Coarse-to-Fine Distillation Objective are jointly sufficient to isolate 50 concepts and close the teacher-student gap without post-hoc filtering or hyperparameter regimes that mask interference is asserted but not demonstrated by any reported ablation or scaling experiment at the target scale.
minor comments (1)
  1. [Abstract] Abstract: 'numerous these effect LoRAs' is grammatically awkward and should be rephrased.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that 'extensive evaluations show' comparable or better fidelity and successful distillation of 50 effects supplies no metrics, baselines, dataset details, ablation tables, or quantitative results, leaving the central claim without visible load-bearing evidence.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative support. The full manuscript contains detailed metrics, baselines, dataset descriptions, and ablation tables in the Experiments section. We will revise the abstract to incorporate key quantitative results (e.g., fidelity metrics and comparisons to teacher models) while preserving its concise nature. revision: yes

  2. Referee: [Method] Method (components i–iii): the claim that Probabilistic Dual-Stream Routing, Asymmetric Orthogonal Prompting, and Coarse-to-Fine Distillation Objective are jointly sufficient to isolate 50 concepts and close the teacher-student gap without post-hoc filtering or hyperparameter regimes that mask interference is asserted but not demonstrated by any reported ablation or scaling experiment at the target scale.

    Authors: The manuscript reports ablation studies isolating the contribution of each component and scaling results up to 50 effects. However, we acknowledge that more explicit joint ablations and scaling curves at the exact 50-effect target, with controls for post-hoc filtering, would better demonstrate sufficiency. We will add these expanded experiments and analyses in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity: framework claims rest on independent components and evaluations, not self-definition or fitted inputs

full rationale

The paper introduces CollectionLoRA as a new multi-teacher on-policy distillation method with three explicitly named components (Probabilistic Dual-Stream Routing, Asymmetric Orthogonal Prompting, Coarse-to-Fine Distillation Objective). No equations appear in the abstract or description, no parameters are described as fitted then relabeled as predictions, and no self-citations or uniqueness theorems are invoked to justify the central claims. The performance assertions are tied to 'extensive evaluations' rather than reducing by construction to the inputs or prior author work. This satisfies the criteria for a self-contained engineering proposal without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described beyond standard LoRA and distillation machinery.

pith-pipeline@v0.9.1-grok · 5813 in / 1094 out tokens · 30284 ms · 2026-06-29T22:32:12.668534+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 31 canonical work pages · 20 internal anchors

  1. [1]

    Agarwal, R., Vieillard, N., Zhou, Y., Stanczyk, P., Ramos, S., Geist, M., Bachem, O.: On-policy distillation of language models: Learning from self-generated mis- takes (2024),https://arxiv.org/abs/2306.13649

  2. [2]

    arXiv preprint arXiv:2511.20549 (2025)

    Chen, G., Huang, S., Liu, K., Zhu, J., Qu, X., Chen, P., Cheng, Y., Sun, Y.: Flash- dmd: Towards high-fidelity few-step image generation with efficient distillation and joint reinforcement learning. arXiv preprint arXiv:2511.20549 (2025)

  3. [3]

    Chern, E., Hu, Z., Tang, B., Su, J., Chern, S., Deng, Z., Liu, P.: Livetalk: Real-time multimodal interactive video diffusion via improved on-policy distillation (2025), https://arxiv.org/abs/2512.23576

  4. [4]

    aitookit Contributors: aitookit.https://github.com/ostris/ai-toolkit(2025)

  5. [5]

    Contributors, L.: Lightx2v: Light video generation inference framework.https: //github.com/ModelTC/lightx2v(2025)

  6. [6]

    DeepSeek-AI: Deepseek-v4-pro model card.https://huggingface.co/deepseek- ai/DeepSeek-V4-Pro(2026), accessed: 2026-05-04

  7. [7]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025)

  8. [8]

    In: Forty-first international conference on machine learning (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

  9. [9]

    Fang, Z., Huang, W., Zeng, Y., Zhao, Y., Chen, S., Feng, K., Lin, Y., Chen, L., Chen, Z., Cao, S., Zhao, F.: Flow-opd: On-policy distillation for flow matching models (2026),https://arxiv.org/abs/2605.08063

  10. [10]

    Fu,S.,Tamir,N.,Sundaram,S.,Chai,L.,Zhang,R.,Dekel,T.,Isola,P.:Dreamsim: Learning new dimensions of human visual similarity using synthetic data (2023), https://arxiv.org/abs/2306.09344

  11. [11]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image gener- ation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)

  12. [12]

    Gu, Y., Fang, G., Jiang, Y., Mao, W., Han, S., Cai, H., Shou, M.Z.: Anyflow: Any- step video diffusion model with on-policy flow map distillation (2026),https: //arxiv.org/abs/2605.13724

  13. [13]

    Gu, Y., Dong, L., Wei, F., Huang, M.: Minillm: On-policy distillation of large language models (2026),https://arxiv.org/abs/2306.08543

  14. [14]

    In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

    Guo,H.,Zeng,B.,Song,Y.,Zhang,W.,Liu,J.,Zhang,C.:Any2anytryon:Leverag- ing adaptive position embeddings for versatile virtual clothing tasks. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 19085– 19096 (2025)

  15. [15]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models (2021),https://arxiv. org/abs/2106.09685

  16. [16]

    Huang, L., Wang, W., Wu, Z.F., Shi, Y., Dou, H., Liang, C., Feng, Y., Liu, Y., Zhou, J.: In-context lora for diffusion transformers (2024),https://arxiv.org/ abs/2410.23775

  17. [17]

    arXiv preprint arXiv:2502.14397 (2025) 18 F

    Huang, S., Song, Y., Zhang, Y., Guo, H., Wang, X., Shou, M.Z., Liu, J.: Photodoo- dle: Learning artistic image editing from few-shot pairwise data. arXiv preprint arXiv:2502.14397 (2025) 18 F. Wu et al

  18. [18]

    Jiang, D., Jin, X., Liu, D., Wang, Z., Zheng, M., Du, R., Yang, X., Wu, Q., Li, Z., Gao, P., Yang, H., Hoi, S.: D-opsd: On-policy self-distillation for continuously tuningstep-distilleddiffusionmodels(2026),https://arxiv.org/abs/2605.05204

  19. [19]

    URLhttps://doi.org/10.48550/arXiv.2511.13649

    Jiang, D., Liu, D., Wang, Z., Wu, Q., Li, L., Li, H., Jin, X., Liu, D., Li, Z., Zhang, B., et al.: Distribution matching distillation meets reinforcement learning. arXiv preprint arXiv:2511.13649 (2025)

  20. [20]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept cus- tomization of text-to-image diffusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1931–1941 (2023)

  21. [21]

    Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024)

  22. [22]

    Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025)

  23. [23]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dock- horn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow match- ing for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025)

  24. [24]

    Li, Y., Zuo, Y., He, B., Zhang, J., Xiao, C., Qian, C., Yu, T., ang Gao, H., Yang, W., Liu, Z., Ding, N.: Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe (2026),https://arxiv.org/abs/2604. 13016

  25. [25]

    Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling (2023),https://arxiv.org/abs/2210.02747

  26. [26]

    Decoupled DMD: CFG augmentation as the spear, distribution matching as the shield.CoRR, abs/2511.22677,

    Liu, D., Gao, P., Liu, D., Du, R., Li, Z., Wu, Q., Jin, X., Cao, S., Zhang, S., Li, H., et al.: Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield. arXiv preprint arXiv:2511.22677 (2025)

  27. [27]

    In: AAAI

    Liu, M., Ma, Y., Yang, Z., Dan, J., Yu, Y., Zhao, Z., Hu, Z., Liu, B., Fan, C.: Llm4gen: Leveraging semantic representation of llms for text-to-image generation. In: AAAI. pp. 5523–5531 (2025)

  28. [28]

    In: CVPR

    Liu, M., She, D., Pang, J., Huang, Q., Ying, J., He, W., Hou, Y., Fu, S.: Tfcus- tom: Customized image generation with time-aware frequency feature guidance. In: CVPR. pp. 2714–2723 (2025)

  29. [29]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency mod- els: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)

  30. [30]

    In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers

    Mou, C., Wu, Y., Wu, W., Guo, Z., Zhang, P., Cheng, Y., Luo, Y., Ding, F., Zhang, S., Li, X., et al.: Dreamo: A unified framework for image customization. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–12 (2025)

  31. [31]

    Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

  32. [32]

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021),https://arxiv.org/abs/ 2103.00020

  33. [33]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  34. [34]

    Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth:Finetuningtext-to-imagediffusionmodelsforsubject-drivengeneration.In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 22500–22510 (2023) CollectionLoRA 19

  35. [35]

    In: ICLR (2026)

    She, D., Fu, S., Liu, M., Jin, Q., Wang, H., Liu, M., Jiang, J.: Mosaic: Multi-subject personalized generation via correspondence-aware alignment and disentanglement. In: ICLR (2026)

  36. [36]

    In: European Conference on Computer Vision

    Song, K., Zhu, Y., Liu, B., Yan, Q., Elgammal, A., Yang, X.: Moma: Multimodal llm adapter for fast personalized image generation. In: European Conference on Computer Vision. pp. 117–132. Springer (2024)

  37. [37]

    Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models (2023)

  38. [38]

    Song, Y., Liu, C., Shou, M.Z.: Omniconsistency: Learning style-agnostic consis- tency from paired stylization data (2025),https://arxiv.org/abs/2505.18445

  39. [39]

    Team, C., Xiao, B., Xia, B., Yang, B., Gao, B., Shen, B., Zhang, C., He, C., Lou, C., Luo, F., Wang, G., Xie, G., Zhang, H., Lv, H., Li, H., Chen, H., Xu, H., Zhang, H., Liu, H., Duo, J., Wei, J., Xiao, J., Dong, J., Shi, J., Hu, J., Bao, K., Zhou, K., Li, L., Zhao, L., Zhang, L., Li, P., Chen, Q., Liu, S., Yu, S., Cao, S., Chen, S., Yu, S., Liu, S., Zhou...

  40. [40]

    Advances in neural information processing systems37, 83951–84009 (2024)

    Wang, F.Y., Huang, Z., Bergman, A., Shen, D., Gao, P., Lingelbach, M., Sun, K., Bian, W., Song, G., Liu, Y., et al.: Phased consistency models. Advances in neural information processing systems37, 83951–84009 (2024)

  41. [41]

    InstantID: Zero-shot Identity-Preserving Generation in Seconds

    Wang, Q., Bai, X., Wang, H., Qin, Z., Chen, A., Li, H., Tang, X., Hu, Y.: Instantid: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519 (2024)

  42. [42]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 15943–15953 (2023)

  43. [44]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

  44. [45]

    Wu, F., Liu, M., He, W., Wang, Z., Yu, Y.: Dcoar: Deep concept injection into unified autoregressive models for personalized text-to-image generation (2025), https://arxiv.org/abs/2508.07341

  45. [46]

    org/abs/2509.26346

    Wu, K., Jiang, S., Ku, M., Nie, P., Liu, M., Chen, W.: Editreward: A human- aligned reward model for instruction-guided image editing (2026),https://arxiv. org/abs/2509.26346

  46. [47]

    arXiv preprint arXiv:2310.08580 (2023)

    Xie, Y., Jampani, V., Zhong, L., Sun, D., Jiang, H.: Omnicontrol: Control any joint at any time for human motion generation. arXiv preprint arXiv:2310.08580 (2023)

  47. [48]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, 20 F. Wu et al. J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., ...

  48. [49]

    Yang, W., Liu, W., Xie, R., Yang, K., Yang, S., Lin, Y.: Learning beyond teacher: Generalized on-policy distillation with reward extrapolation (2026),https:// arxiv.org/abs/2602.12125

  49. [50]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023)

  50. [51]

    Advances in neural information processing systems37, 47455–47487 (2024)

    Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., Freeman, B.: Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems37, 47455–47487 (2024)

  51. [52]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6613– 6623 (2024)

  52. [53]

    Advances in Neural Information Processing Systems37, 111000–111021 (2024)

    Zhai, Y., Lin, K., Yang, Z., Li, L., Wang, J., Lin, C.C., Doermann, D., Yuan, J., Wang, L.: Motion consistency model: Accelerating video diffusion with disentan- gled motion-appearance distillation. Advances in Neural Information Processing Systems37, 111000–111021 (2024)

  53. [54]

    Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection (2022), https://arxiv.org/abs/2203.03605

  54. [55]

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models (2023),https://arxiv.org/abs/2302.05543

  55. [56]

    Template Source Image

    Zhang, Y., Yuan, Y., Song, Y., Wang, H., Liu, J.: Easycontrol: Adding efficient and flexible control for diffusion transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19513–19524 (2025) CollectionLoRA 21 Overview of Supplementary Material Thissupplementarydocumentprovidescomprehensivetechnicaldetails,in-depth theo...