IdGlow: Dynamic Identity Modulation for Multi-Subject Generation

Changhao Qiao; Chao Hui; Haohua Chen; Honghao Cai; Jing Li; Runqi Wang; Sijie Xu; Tianze Zhou; Wei Zhu; Xiangyuan Wang

arxiv: 2603.00607 · v2 · pith:C26XQVQ6new · submitted 2026-02-28 · 💻 cs.CV · cs.AI

IdGlow: Dynamic Identity Modulation for Multi-Subject Generation

Honghao Cai , Xiangyuan Wang , Jing Li , Yunhao Bai , Tianze Zhou , Haohua Chen , Chao Hui , Changhao Qiao

show 10 more authors

Runqi Wang Sijie Xu Yuyang Hao Zezhou Cui Yuyuan Yang Wei Zhu Yibo Chen Xu Tang Yao Hu Zhen Li

This is my paper

Pith reviewed 2026-05-21 12:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multi-subject image generationidentity preservationstability-plasticity dilemmaflow matchingdirect preference optimizationdiffusion modelsage transformationgroup composition

0 comments

The pith

IdGlow uses adaptive timestep scheduling and group-level preference optimization to balance identity preservation with natural scene composition in mask-free multi-subject generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the stability-plasticity dilemma where methods must keep reference faces recognizable while allowing flexible scene changes and deformations. Current approaches rely on rigid masks or attention that break down for tasks like turning adults into children within a group photo. IdGlow builds a two-stage process on flow matching diffusion models: first a fine-tuning step with decaying timestep constraints and a gating window that injects identity only during key moments, plus a vision-language model to create context prompts from bad cases. The second stage applies weighted direct preference optimization at the group level to fix artifacts and improve harmony. If effective, this removes the need for spatial inputs and produces images that look both faithful and aesthetically natural.

Core claim

IdGlow is a mask-free, progressive two-stage framework on Flow Matching diffusion models. The supervised fine-tuning stage introduces linear decay timestep scheduling to relax constraints for natural group composition and a temporal gating mechanism that limits identity injection to a critical semantic window, preserving adult facial semantics without overriding child-like structures. A badcase-driven vision-language model supplies precise prompts to avoid attribute leakage and ambiguity. The second stage uses fine-grained group-level direct preference optimization with weighted margins to remove multi-subject artifacts, improve texture harmony, and align identity fidelity with real-worlds,,

What carries the argument

Task-adaptive timestep scheduling with linear decay and temporal gating that concentrates identity signals in a critical semantic window, paired with badcase-driven VLM prompt synthesis and weighted-margin group-level direct preference optimization.

If this is right

Complex deformations such as age transformation become feasible while keeping multiple reference identities intact.
Group images can be composed without relying on explicit spatial masks or localized attention mechanisms.
Texture harmony and overall aesthetic quality improve alongside identity fidelity on real-world distributions.
Performance gains appear on both direct multi-person fusion and age-transformed group generation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The scheduling and gating ideas could transfer to other conditional generation settings where preservation must coexist with structural change.
The badcase-driven prompt method might reduce manual prompt engineering in broader image editing or composition tools.
Extending the group-level optimization to video or multi-view synthesis could test whether the same balance holds over time or across viewpoints.

Load-bearing premise

A vision-language model trained on bad cases can reliably generate precise context-aware prompts that fix attribute leakage and semantic ambiguity without any layout or spatial guidance.

What would settle it

Running the model on mixed-age group scenes with adult identities and checking whether child anatomical features remain intact or whether faces blend incorrectly when the VLM prompt step is removed.

Figures

Figures reproduced from arXiv: 2603.00607 by Changhao Qiao, Chao Hui, Haohua Chen, Honghao Cai, Jing Li, Runqi Wang, Sijie Xu, Tianze Zhou, Wei Zhu, Xiangyuan Wang, Xu Tang, Yao Hu, Yibo Chen, Yunhao Bai, Yuyang Hao, Yuyuan Yang, Zezhou Cui, Zhen Li.

**Figure 2.** Figure 2: The architecture of IdGlow-DiT. The model processes variable numbers of reference identities through a unified encoding strategy, forming a concatenated multi-ID sequence. A key innovation is the Dynamics-Aware Gating Module (highlighted in orange), which modulates the intensity of the identity sequence based on the diffusion timestep t and the specific task (e.g., age transformation curves). These gated … view at source ↗

**Figure 3.** Figure 3: Task-specific prompt synthesis via the Image-Edit-Prompt model. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Dynamics-aware identity modulation tailored to specific generative tasks. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Multi-subject image generation requires seamlessly harmonizing multiple reference identities within a coherent scene. However, existing methods relying on rigid spatial masks or localized attention often struggle with the "stability-plasticity dilemma," particularly failing in tasks that require complex structural deformations, such as identity-preserving age transformation. To address this, we present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models. In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics: a linear decay schedule that progressively relaxes constraints for natural group composition, and a temporal gating mechanism that concentrates identity injection within a critical semantic window, successfully preserving adult facial semantics without overriding child-like anatomical structures. To resolve attribute leakage and semantic ambiguity without explicit layout inputs, we further integrate a badcase-driven Vision-Language Model (VLM) for precise, context-aware prompt synthesis. In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts, elevate texture harmony, and recalibrate identity fidelity towards real-world distributions. Extensive experiments on two challenging benchmarks -- direct multi-person fusion and age-transformed group generation -- demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IdGlow pairs linear-decay timestep scheduling and temporal gating with a badcase-driven VLM and weighted group-level DPO to tackle mask-free multi-subject harmony, but the abstract supplies no numbers or ablations to show it works.

read the letter

The main point is that this paper describes a two-stage mask-free pipeline on flow matching models for generating scenes with multiple reference identities. In the SFT stage it uses a linear decay schedule on timesteps plus a temporal gating mechanism to inject identity information only during a critical window, aiming to keep adult facial features while allowing child-like body changes. It adds a VLM that looks at bad cases to write better prompts without any masks or layout inputs, then runs a fine-grained group-level DPO with a weighted margin to reduce artifacts and push outputs toward real distributions. The claim is that this combination beats the stability-plasticity tradeoff on direct fusion and age-transformed group benchmarks.

Referee Report

3 major / 2 minor

Summary. The manuscript presents IdGlow, a mask-free progressive two-stage framework for multi-subject image generation based on Flow Matching diffusion models. The first stage involves supervised fine-tuning with task-adaptive timestep scheduling using a linear decay schedule and a temporal gating mechanism to preserve identity semantics. A badcase-driven Vision-Language Model is used for context-aware prompt synthesis to avoid attribute leakage without spatial inputs. The second stage employs Fine-Grained Group-Level Direct Preference Optimization (DPO) with weighted margin to improve harmony and fidelity. Experiments on multi-person fusion and age-transformed group generation benchmarks claim to achieve a superior balance between facial fidelity and aesthetic quality, mitigating the stability-plasticity conflict.

Significance. If the quantitative results and ablations substantiate the claims, this work could offer a significant advance in mask-free multi-subject generation by addressing the stability-plasticity dilemma through dynamic modulation techniques. The integration of VLM for prompt synthesis and group-level DPO represents an interesting approach to handling complex identity interactions without rigid spatial conditioning. However, the current presentation leaves the empirical support unclear.

major comments (3)

[Abstract] The abstract asserts superior performance on two benchmarks yet supplies no quantitative numbers, error bars, ablation results, or baseline comparisons, leaving the central claim unsupported by visible evidence.
[§3.2 (Prompt Synthesis)] The assumption that a badcase-driven Vision-Language Model can reliably produce precise, context-aware prompts that resolve attribute leakage and semantic ambiguity without any explicit layout or spatial inputs is load-bearing for the mask-free claim. No quantitative evaluation of VLM prompt fidelity for the age-transformation or group-fusion benchmarks is provided.
[§3.3 (DPO Stage)] The Fine-Grained Group-Level DPO optimizes directly against human or model preferences on generated outputs, raising a circularity concern where the reported fidelity gains may be partly defined by the same optimization loop used to produce them.

minor comments (2)

[§3.1 and §3.3] The definitions of the linear decay schedule parameters and the weighted margin in group-level DPO could be clarified with explicit equations to aid reproducibility.
[Related Work] Additional references to recent works on mask-free multi-subject generation or VLM-guided diffusion would strengthen the positioning of the contributions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be incorporated to strengthen the presentation and empirical support.

read point-by-point responses

Referee: [Abstract] The abstract asserts superior performance on two benchmarks yet supplies no quantitative numbers, error bars, ablation results, or baseline comparisons, leaving the central claim unsupported by visible evidence.

Authors: We agree that the abstract would be strengthened by including concrete quantitative support. In the revised manuscript we will add key metrics from the experiments section, including facial fidelity and aesthetic quality scores with baseline comparisons, to directly evidence the claimed performance. revision: yes
Referee: [§3.2 (Prompt Synthesis)] The assumption that a badcase-driven Vision-Language Model can reliably produce precise, context-aware prompts that resolve attribute leakage and semantic ambiguity without any explicit layout or spatial inputs is load-bearing for the mask-free claim. No quantitative evaluation of VLM prompt fidelity for the age-transformation or group-fusion benchmarks is provided.

Authors: This is a valid observation regarding the centrality of the VLM component. While end-to-end benchmark results demonstrate the practical effectiveness of the synthesized prompts, we will add a quantitative evaluation of prompt fidelity (e.g., semantic similarity and leakage reduction metrics) specifically on the age-transformation and group-fusion benchmarks. revision: yes
Referee: [§3.3 (DPO Stage)] The Fine-Grained Group-Level DPO optimizes directly against human or model preferences on generated outputs, raising a circularity concern where the reported fidelity gains may be partly defined by the same optimization loop used to produce them.

Authors: We appreciate the methodological concern. The preference data is collected from human annotators on a held-out set of SFT-stage outputs, and final reporting relies on independent objective metrics and test splits. We will revise §3.3 to explicitly detail this separation between preference collection and evaluation protocols. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained with independent methodological contributions

full rationale

The paper presents a progressive two-stage framework consisting of supervised fine-tuning with task-adaptive timestep scheduling and temporal gating, integrated with badcase-driven VLM prompt synthesis to address attribute leakage, followed by Fine-Grained Group-Level DPO with weighted margin formulation. These steps are described as targeting distinct facets of the stability-plasticity dilemma in mask-free multi-subject generation. Claims of superior Pareto balance are supported by experiments on the direct multi-person fusion and age-transformed group generation benchmarks, without any reduction of results to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The central derivation remains externally falsifiable via the reported benchmark outcomes rather than tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The framework rests on several domain assumptions about diffusion dynamics and VLM capabilities plus newly introduced mechanisms whose effectiveness is asserted rather than independently verified.

free parameters (2)

linear decay schedule parameters
The rate and breakpoints of the linear decay used for timestep scheduling are chosen to align with generative dynamics and are therefore fitted or hand-tuned.
weighted margin in group-level DPO
The margin weighting in the preference optimization objective is a tunable hyper-parameter that directly influences the final identity and harmony scores.

axioms (2)

domain assumption Flow Matching diffusion models admit task-adaptive timestep scheduling that progressively relaxes identity constraints without destroying earlier semantic structure.
Invoked to justify the supervised fine-tuning stage.
ad hoc to paper A vision-language model driven by bad-case examples can synthesize prompts that eliminate attribute leakage without spatial layout information.
Central to the prompt-synthesis component.

invented entities (2)

temporal gating mechanism no independent evidence
purpose: Concentrate identity injection inside a critical semantic window during diffusion
New component introduced to preserve adult facial semantics while permitting child-like anatomy.
Fine-Grained Group-Level Direct Preference Optimization no independent evidence
purpose: Simultaneously remove multi-subject artifacts, improve texture harmony, and recalibrate identity fidelity
Custom DPO variant presented as the second training stage.

pith-pipeline@v0.9.0 · 5832 in / 1595 out tokens · 69489 ms · 2026-05-21T12:50:22.174685+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Dynamics-Aware Identity Modulation Strategy... temporal gating mechanism that concentrates identity injection within a critical semantic window t∈[0.3,0.6]
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Task-Adaptive Loss Annealing... linear decay schedule... Fine-Grained Group-Level Direct Preference Optimization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 9 internal anchors

[1]

ACM Transactions on Graphics (TOG)40(4), 1–12 (2021) 2, 4

Alaluf, Y., Patashnik, O., Cohen-Or, D.: Only a matter of style: Age transformation using a style-based regression model. ACM Transactions on Graphics (TOG)40(4), 1–12 (2021) 2, 4

work page 2021
[2]

Training Diffusion Models with Reinforcement Learning

Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2024) 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

HunyuanImage 3.0 Technical Report

Cao, S., Chen, H., Chen, P., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025) 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: Additive angular margin loss for deep face recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4690–4699 (2019) 2, 10

work page 2019
[5]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206 (2024) 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

In: International Conference on Learning Representations (ICLR) (2023) 2, 3

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An im- age is worth one word: Personalizing text-to-image generation using textual inversion. In: International Conference on Learning Representations (ICLR) (2023) 2, 3

work page 2023
[7]

Seedream 3.0 Technical Report

Gao, Y., Gong, L., Guo, Q., Hou, X., Lai, Z., Li, F., Li, L., Lian, X., Liao, C., Liu, L., Liu, W., et al.: Seedream 3.0 technical report. arXiv preprint arXiv:2504.11346 (2025) 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Pulid: Pure and lightning id customization via contrastive alignment

Guo, Z., Wu, Y., Chen, Z., Chen, L., He, Q.: PuLID: Pure and lightning ID customization via contrastive alignment. arXiv preprint arXiv:2404.16022 (2024) 2, 3

work page arXiv 2024
[9]

In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 2

Hao, Y., Chi, Z., Dong, L., Wei, F.: Optimizing prompts for text-to-image generation. In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 2

work page 2023
[10]

In: Advances in Neural Information Processing Systems (NeurIPS) (2020) 2, 4

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems (NeurIPS) (2020) 2, 4

work page 2020
[11]

In: International Conference on Learning Representations (ICLR) (2018) 10

Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: International Conference on Learning Representations (ICLR) (2018) 10

work page 2018
[12]

In: International Conference on Learning Representations (ICLR) (2014) 5

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: International Conference on Learning Representations (ICLR) (2014) 5

work page 2014
[13]

Naval Research Logistics Quarterly 2(1–2), 83–97 (1955) 7

Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(1–2), 83–97 (1955) 7

work page 1955
[14]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to- image diffusion. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1931–1941 (2023) 3

work page 1931
[15]

Flow-GRPO: Training Flow Matching Models via Online RL

Li, Y., et al.: Training flow matching models via online RL. arXiv preprint arXiv:2505.05470 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

arXiv preprint arXiv:2312.04461 (2024) 2, 3

Li, Z., Cao, M., Wang, X., Qi, Z., Cheng, M.M., Shan, Y.: Photomaker: Customizing realistic human photos via stacked ID embedding. arXiv preprint arXiv:2312.04461 (2024) 2, 3

work page arXiv 2024
[17]

In: International Conference on Learning Representations (ICLR) (2023) 5

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: International Conference on Learning Representations (ICLR) (2023) 5

work page 2023
[18]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2023) 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 2

Mo, W., Zhang, T., Bai, Y., Su, B., Wen, J.R., Yang, Q.: Dynamic prompt optimizing for text-to-image generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 2

work page 2024
[20]

Or-El,R.,Sengupta,S.,Quispe,J.,etal.:Lifespanagetransformationsynthesis.In:EuropeanConference on Computer Vision (ECCV). pp. 739–755 (2020) 2, 4 14

work page 2020
[21]

In: Advances in Neural Information Processing Systems (NeurIPS) (2022) 4

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems (NeurIPS) (2022) 4

work page 2022
[22]

In: European Conference on Computer Vision (ECCV) (2024) 2

Papantoniou, F.P., Lattas, A., Moschoglou, S., Deng, J., Kainz, B., Zafeiriou, S.: Arc2Face: A foundation model for ID-consistent human faces. In: European Conference on Computer Vision (ECCV) (2024) 2

work page 2024
[23]

In: IEEE/CVF International Confer- ence on Computer Vision (ICCV)

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: IEEE/CVF International Confer- ence on Computer Vision (ICCV). pp. 4172–4182 (2023) 5

work page 2023
[24]

In: International Conference on Machine Learning (ICML) (2021) 10

Radford,A.,Kim,J.W.,Hallacy,C.,Ramesh,A.,Goh,G.,Agrawal,S.,Sastry,G.,Askell,A.,Mishkin,P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML) (2021) 10

work page 2021
[25]

In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 3, 4, 8

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimiza- tion: Your language model is secretly a reward model. In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 3, 4, 8

work page 2023
[26]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with la- tent diffusion models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684–10695 (2022) 2, 4

work page 2022
[27]

Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242 (2023) 2, 3

work page arXiv 2023
[28]

Advances in Neural Information Processing Systems (NeurIPS) (2022) 8, 10

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: LAION-5B: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems (NeurIPS) (2022) 8, 10

work page 2022
[29]

In: International Conference on Learning Represen- tations (ICLR) (2021) 2, 4

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Represen- tations (ICLR) (2021) 2, 4

work page 2021
[30]

Diffusion model alignment using direct preference optimization.arXiv preprint arXiv:2311.12908,

Wallace, B., Dang, M., Rafailov, R., Zhou, L., Lou, A., Purushwalkam, S., Ermon, S., Xiong, C., Joty, S., Naik, N.: Diffusion model alignment using direct preference optimization. arXiv preprint arXiv:2311.12908 (2024) 3, 4, 8, 9

work page arXiv 2024
[31]

InstantID: Zero-shot Identity-Preserving Generation in Seconds

Wang, Q., Bai, X., Wang, H., Qin, Z., Chen, A.: InstantID: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519 (2024) 2, 3, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

In: International Conference on Learning Representations (ICLR) (2025) 3

Wang, X., Huang, Q., et al.: MS-Diffusion: Multi-subject zero-shot image personalization with layout guidance. In: International Conference on Learning Representations (ICLR) (2025) 3

work page 2025
[33]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025) 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

International Journal of Computer Vision (IJCV) (2024) 3, 10, 11

Xiao, G., Yin, T., Freeman, W.T., Durand, F., Han, S.: FastComposer: Tuning-free multi-subject image generation with localized attention. International Journal of Computer Vision (IJCV) (2024) 3, 10, 11

work page 2024
[35]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023) 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

In: Advances in Neural Information Processing Systems (NeurIPS) (2024) 3 15

Zhou, Y., Zhou, D., Cheng, M., Feng, J., Hou, Q.: StoryDiffusion: Consistent self-attention for long-range image and video generation. In: Advances in Neural Information Processing Systems (NeurIPS) (2024) 3 15

work page 2024

[1] [1]

ACM Transactions on Graphics (TOG)40(4), 1–12 (2021) 2, 4

Alaluf, Y., Patashnik, O., Cohen-Or, D.: Only a matter of style: Age transformation using a style-based regression model. ACM Transactions on Graphics (TOG)40(4), 1–12 (2021) 2, 4

work page 2021

[2] [2]

Training Diffusion Models with Reinforcement Learning

Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2024) 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

HunyuanImage 3.0 Technical Report

Cao, S., Chen, H., Chen, P., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025) 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: Additive angular margin loss for deep face recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4690–4699 (2019) 2, 10

work page 2019

[5] [5]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206 (2024) 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

In: International Conference on Learning Representations (ICLR) (2023) 2, 3

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An im- age is worth one word: Personalizing text-to-image generation using textual inversion. In: International Conference on Learning Representations (ICLR) (2023) 2, 3

work page 2023

[7] [7]

Seedream 3.0 Technical Report

Gao, Y., Gong, L., Guo, Q., Hou, X., Lai, Z., Li, F., Li, L., Lian, X., Liao, C., Liu, L., Liu, W., et al.: Seedream 3.0 technical report. arXiv preprint arXiv:2504.11346 (2025) 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Pulid: Pure and lightning id customization via contrastive alignment

Guo, Z., Wu, Y., Chen, Z., Chen, L., He, Q.: PuLID: Pure and lightning ID customization via contrastive alignment. arXiv preprint arXiv:2404.16022 (2024) 2, 3

work page arXiv 2024

[9] [9]

In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 2

Hao, Y., Chi, Z., Dong, L., Wei, F.: Optimizing prompts for text-to-image generation. In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 2

work page 2023

[10] [10]

In: Advances in Neural Information Processing Systems (NeurIPS) (2020) 2, 4

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems (NeurIPS) (2020) 2, 4

work page 2020

[11] [11]

In: International Conference on Learning Representations (ICLR) (2018) 10

Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: International Conference on Learning Representations (ICLR) (2018) 10

work page 2018

[12] [12]

In: International Conference on Learning Representations (ICLR) (2014) 5

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: International Conference on Learning Representations (ICLR) (2014) 5

work page 2014

[13] [13]

Naval Research Logistics Quarterly 2(1–2), 83–97 (1955) 7

Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(1–2), 83–97 (1955) 7

work page 1955

[14] [14]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to- image diffusion. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1931–1941 (2023) 3

work page 1931

[15] [15]

Flow-GRPO: Training Flow Matching Models via Online RL

Li, Y., et al.: Training flow matching models via online RL. arXiv preprint arXiv:2505.05470 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

arXiv preprint arXiv:2312.04461 (2024) 2, 3

Li, Z., Cao, M., Wang, X., Qi, Z., Cheng, M.M., Shan, Y.: Photomaker: Customizing realistic human photos via stacked ID embedding. arXiv preprint arXiv:2312.04461 (2024) 2, 3

work page arXiv 2024

[17] [17]

In: International Conference on Learning Representations (ICLR) (2023) 5

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: International Conference on Learning Representations (ICLR) (2023) 5

work page 2023

[18] [18]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2023) 5

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 2

Mo, W., Zhang, T., Bai, Y., Su, B., Wen, J.R., Yang, Q.: Dynamic prompt optimizing for text-to-image generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 2

work page 2024

[20] [20]

Or-El,R.,Sengupta,S.,Quispe,J.,etal.:Lifespanagetransformationsynthesis.In:EuropeanConference on Computer Vision (ECCV). pp. 739–755 (2020) 2, 4 14

work page 2020

[21] [21]

In: Advances in Neural Information Processing Systems (NeurIPS) (2022) 4

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems (NeurIPS) (2022) 4

work page 2022

[22] [22]

In: European Conference on Computer Vision (ECCV) (2024) 2

Papantoniou, F.P., Lattas, A., Moschoglou, S., Deng, J., Kainz, B., Zafeiriou, S.: Arc2Face: A foundation model for ID-consistent human faces. In: European Conference on Computer Vision (ECCV) (2024) 2

work page 2024

[23] [23]

In: IEEE/CVF International Confer- ence on Computer Vision (ICCV)

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: IEEE/CVF International Confer- ence on Computer Vision (ICCV). pp. 4172–4182 (2023) 5

work page 2023

[24] [24]

In: International Conference on Machine Learning (ICML) (2021) 10

Radford,A.,Kim,J.W.,Hallacy,C.,Ramesh,A.,Goh,G.,Agrawal,S.,Sastry,G.,Askell,A.,Mishkin,P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML) (2021) 10

work page 2021

[25] [25]

In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 3, 4, 8

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimiza- tion: Your language model is secretly a reward model. In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 3, 4, 8

work page 2023

[26] [26]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with la- tent diffusion models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684–10695 (2022) 2, 4

work page 2022

[27] [27]

Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242 (2023) 2, 3

work page arXiv 2023

[28] [28]

Advances in Neural Information Processing Systems (NeurIPS) (2022) 8, 10

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: LAION-5B: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems (NeurIPS) (2022) 8, 10

work page 2022

[29] [29]

In: International Conference on Learning Represen- tations (ICLR) (2021) 2, 4

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Represen- tations (ICLR) (2021) 2, 4

work page 2021

[30] [30]

Diffusion model alignment using direct preference optimization.arXiv preprint arXiv:2311.12908,

Wallace, B., Dang, M., Rafailov, R., Zhou, L., Lou, A., Purushwalkam, S., Ermon, S., Xiong, C., Joty, S., Naik, N.: Diffusion model alignment using direct preference optimization. arXiv preprint arXiv:2311.12908 (2024) 3, 4, 8, 9

work page arXiv 2024

[31] [31]

InstantID: Zero-shot Identity-Preserving Generation in Seconds

Wang, Q., Bai, X., Wang, H., Qin, Z., Chen, A.: InstantID: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519 (2024) 2, 3, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

In: International Conference on Learning Representations (ICLR) (2025) 3

Wang, X., Huang, Q., et al.: MS-Diffusion: Multi-subject zero-shot image personalization with layout guidance. In: International Conference on Learning Representations (ICLR) (2025) 3

work page 2025

[33] [33]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025) 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

International Journal of Computer Vision (IJCV) (2024) 3, 10, 11

Xiao, G., Yin, T., Freeman, W.T., Durand, F., Han, S.: FastComposer: Tuning-free multi-subject image generation with localized attention. International Journal of Computer Vision (IJCV) (2024) 3, 10, 11

work page 2024

[35] [35]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023) 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

In: Advances in Neural Information Processing Systems (NeurIPS) (2024) 3 15

Zhou, Y., Zhou, D., Cheng, M., Feng, J., Hou, Q.: StoryDiffusion: Consistent self-attention for long-range image and video generation. In: Advances in Neural Information Processing Systems (NeurIPS) (2024) 3 15

work page 2024