pith. machine review for the scientific record. sign in

arxiv: 2512.01236 · v2 · submitted 2025-12-01 · 💻 cs.CV

PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards

Pith reviewed 2026-05-17 03:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-subject personalizationimage generationsubject consistencyreinforcement learningsynthetic datatext-to-imagepersonalized generation
0
0 comments X

The pith

A synthetic data pipeline plus pairwise rewards lets single-subject image models handle multiple subjects with better consistency and prompt following.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Single-subject personalized image generators work well but lose performance when asked to place several subjects in one scene while matching a text prompt. The authors build a pipeline that uses existing single-subject models to create large-scale synthetic multi-subject training examples. They then run a reinforcement learning stage guided by Pairwise Subject-Consistency Rewards that penalize identity drift between subjects and by general rewards that keep the output faithful to the prompt. A new benchmark with seven subsets measures success across consistency, controllability, and other axes. The central claim is that this combination scales multi-subject personalization without requiring new real-world multi-subject datasets.

Core claim

A scalable pipeline first generates diverse multi-subject training data by prompting strong single-subject models, allowing those models to learn multi-image and multi-subject synthesis. A subsequent reinforcement learning stage then applies Pairwise Subject-Consistency Rewards together with general-purpose rewards to jointly improve subject identity preservation and text controllability, producing models that maintain consistency across multiple subjects while following complex prompts.

What carries the argument

The Pairwise Subject-Consistency Rewards, which measure and reward identity agreement between every pair of subjects within generated images during the reinforcement learning stage.

If this is right

  • Single-subject models can be upgraded to multi-subject use without collecting new paired training data.
  • The introduced benchmark supplies a standardized test covering seven subsets and three evaluation dimensions for future multi-subject methods.
  • Adding general-purpose rewards alongside consistency rewards improves both identity fidelity and prompt adherence at the same time.
  • The data-generation pipeline can be repeated at larger scale to keep improving performance as base single-subject models advance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pairwise-reward idea could be tested on video or 3D generation where multiple subjects must stay consistent across frames or views.
  • If the synthetic data proves robust, the method lowers the barrier to personalized generation for users who lack access to multi-person reference photos.
  • Future extensions might combine these rewards with other objectives such as aesthetic or safety constraints without changing the core pipeline.

Load-bearing premise

Synthetic images produced by single-subject models contain enough variety and lack systematic biases that would prevent the rewards from teaching reliable multi-subject behavior.

What would settle it

Run the trained model on a held-out set of real photographs showing multiple distinct people and measure whether the generated subjects retain their identities across varied prompts and spatial arrangements.

Figures

Figures reproduced from arXiv: 2512.01236 by Hui Lu, Jianbo Ouyang, Longhui Wei, Qi Tian, Shulei Wang, Xin He, Zhou Zhao.

Figure 1
Figure 1. Figure 1: Quantitative comparison of recent methods on PSRBench across three evaluation dimensions: subject consistency, aesthetic [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the dataset construction pipeline. The process consists of two stages: (1) multi-subject paired image generation, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left: Scalable frame-wise positional encoding. Middle: Pairwise subject-consistency rewards. Right: GRPO training pipeline [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative analysis results of PSR with recent state-of-the-art models. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of PSRBench, with a case from each subset shown on the left and the three evaluation dimensions for each subset [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Analysis on DreamBench [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Instruction template used for providing to Qwen3 to construct the dataset. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Instruction template used for providing to Qwen2.5-VL to evaluate the semantic alignment scores of different subsets. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of different metrics [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of different metrics [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
read the original abstract

Personalized generation models for a single subject have demonstrated remarkable effectiveness, highlighting their significant potential. However, when extended to multiple subjects, existing models often exhibit degraded performance, particularly in maintaining subject consistency and adhering to textual prompts. We attribute these limitations to the absence of high-quality multi-subject datasets and refined post-training strategies. To address these challenges, we propose a scalable multi-subject data generation pipeline that leverages powerful single-subject generation models to construct diverse and high-quality multi-subject training data. Through this dataset, we first enable single-subject personalization models to acquire knowledge of synthesizing multi-image and multi-subject scenarios. Furthermore, to enhance both subject consistency and text controllability, we design a set of Pairwise Subject-Consistency Rewards and general-purpose rewards, which are incorporated into a refined reinforcement learning stage. To comprehensively evaluate multi-subject personalization, we introduce a new benchmark that assesses model performance using seven subsets across three dimensions. Extensive experiments demonstrate the effectiveness of our approach in advancing multi-subject personalized image generation. Github Link: https://github.com/wang-shulei/PSR

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to advance multi-subject personalized image generation by proposing a scalable data generation pipeline that uses existing single-subject models to synthesize diverse multi-subject training pairs, followed by a reinforcement learning stage incorporating Pairwise Subject-Consistency Rewards (plus general-purpose rewards) to improve subject consistency and text controllability. It also introduces a new benchmark consisting of seven subsets across three evaluation dimensions and reports that extensive experiments demonstrate the effectiveness of the overall approach.

Significance. If the central claims hold after addressing data-quality concerns, the work would be significant for the field: it offers a practical route to scale personalization beyond single subjects without requiring large-scale real multi-subject datasets, and the new benchmark could become a useful standard for future comparisons. The combination of synthetic data construction with reward-driven RL is a reasonable engineering contribution, though its impact depends on whether the synthetic pairs truly capture inter-subject relations without systematic artifacts.

major comments (3)
  1. [§3] §3 (Multi-Subject Data Generation Pipeline): The pipeline constructs training pairs by prompting single-subject models; the manuscript must include a quantitative audit (e.g., human preference scores or automated metrics on interaction fidelity, occlusion handling, and prompt adherence) comparing the synthetic pairs against real multi-subject photographs or artist-created references. Without this, the claim that the data is “diverse and high-quality” remains unverified and directly affects the reliability of the subsequent RL stage.
  2. [§5] §5 (Experiments and Benchmark): The abstract states that “extensive experiments demonstrate the effectiveness,” yet the provided description supplies no numerical results, baseline comparisons, or ablation tables. The manuscript should report concrete metrics (e.g., subject consistency scores, CLIP text alignment, FID) on the new benchmark’s seven subsets, together with statistical significance tests against at least two strong baselines (e.g., DreamBooth extended to multi-subject and a recent multi-subject method).
  3. [§4.2] §4.2 (Pairwise Subject-Consistency Rewards): The rewards are defined on generated images that themselves depend on the synthetic training distribution; if reward weights or filtering thresholds were tuned using the same evaluation sets used for final reporting, the reported gains risk circularity. The authors should state explicitly whether any hyper-parameters were selected on held-out data or via cross-validation and provide the exact reward formulation (including any learned components).
minor comments (2)
  1. The abstract mentions “seven subsets across three dimensions” but does not name the dimensions or list the subsets; this information should appear in the abstract or in a dedicated table in §5.
  2. Figure captions and axis labels in the experimental section should explicitly indicate which metric corresponds to which benchmark subset to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§3] §3 (Multi-Subject Data Generation Pipeline): The pipeline constructs training pairs by prompting single-subject models; the manuscript must include a quantitative audit (e.g., human preference scores or automated metrics on interaction fidelity, occlusion handling, and prompt adherence) comparing the synthetic pairs against real multi-subject photographs or artist-created references. Without this, the claim that the data is “diverse and high-quality” remains unverified and directly affects the reliability of the subsequent RL stage.

    Authors: We agree that an explicit quantitative audit of the synthetic data would strengthen the paper. In the revised version we add a dedicated subsection with both human preference scores (n=200 raters) and automated metrics (CLIP-based interaction fidelity, occlusion detection via segmentation, and prompt adherence) comparing our generated pairs against a held-out set of real multi-subject photographs and artist references. These results confirm that the synthetic pairs achieve comparable or superior fidelity on the targeted dimensions. revision: yes

  2. Referee: [§5] §5 (Experiments and Benchmark): The abstract states that “extensive experiments demonstrate the effectiveness,” yet the provided description supplies no numerical results, baseline comparisons, or ablation tables. The manuscript should report concrete metrics (e.g., subject consistency scores, CLIP text alignment, FID) on the new benchmark’s seven subsets, together with statistical significance tests against at least two strong baselines (e.g., DreamBooth extended to multi-subject and a recent multi-subject method).

    Authors: The full manuscript already contains numerical results, ablation tables, and comparisons on the seven benchmark subsets. To make these findings more prominent and directly responsive to the comment, we have added a consolidated results table reporting subject consistency, CLIP text alignment, and FID scores, together with paired t-test significance values against DreamBooth (multi-subject extension) and a recent multi-subject baseline. We also include the full per-subset breakdown. revision: yes

  3. Referee: [§4.2] §4.2 (Pairwise Subject-Consistency Rewards): The rewards are defined on generated images that themselves depend on the synthetic training distribution; if reward weights or filtering thresholds were tuned using the same evaluation sets used for final reporting, the reported gains risk circularity. The authors should state explicitly whether any hyper-parameters were selected on held-out data or via cross-validation and provide the exact reward formulation (including any learned components).

    Authors: We confirm that all reward weights and filtering thresholds were selected exclusively on a held-out validation split that is disjoint from both the training data and the final test benchmark. The exact mathematical formulation of the Pairwise Subject-Consistency Rewards (including the learned component) is already given in Section 4.2; we have added an explicit paragraph in the revised manuscript stating the data-separation protocol and cross-validation procedure to eliminate any ambiguity regarding circularity. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation builds on external single-subject models with independent rewards and benchmark

full rationale

The paper describes generating multi-subject training data via existing single-subject models, then applying newly designed Pairwise Subject-Consistency Rewards within an RL stage, followed by evaluation on a new benchmark with seven subsets. No equations, definitions, or steps in the provided text reduce the claimed performance gains to fitted parameters on the same data, self-citations that bear the central load, or ansatzes smuggled from prior author work. The pipeline and rewards are presented as additive refinements rather than tautological redefinitions of the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that synthetic data from single-subject models transfers well and that the designed rewards are effective without extensive hyperparameter search details being provided in the abstract.

free parameters (1)
  • reward weights
    Weights balancing pairwise consistency rewards against general-purpose rewards are likely tuned; exact values not stated in abstract.
axioms (1)
  • domain assumption Single-subject personalization models can generate sufficiently diverse and unbiased multi-subject training images.
    Invoked when describing the scalable data generation pipeline.

pith-pipeline@v0.9.0 · 5500 in / 1094 out tokens · 70161 ms · 2026-05-17T03:46:11.544879+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.

  2. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 1 Pith paper · 18 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4, 6, 2

  2. [2]

    Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation.arXiv preprint arXiv:2506.21416, 2025

    Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, and Xinglong Wu. Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation.arXiv preprint arXiv:2506.21416, 2025. 2, 3, 4, 6, 7, 1

  3. [3]

    Directly Fine-Tuning Diffusion Models on Differentiable Rewards

    Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable re- wards.arXiv preprint arXiv:2309.17400, 2023. 3

  4. [4]

    GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

    Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, and Xihui Liu. Got-r1: Unleashing reasoning capability of mllm for vi- sual generation with reinforcement learning.arXiv preprint arXiv:2505.17022, 2025

  5. [5]

    Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362, 2023

    Ying Fan and Kangwook Lee. Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362, 2023

  6. [6]

    Dpok: Reinforcement learning for fine-tuning text-to-image diffu- sion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffu- sion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023. 3

  7. [7]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022. 2

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3

  9. [9]

    Pulid: Pure and lightning id customization via contrastive alignment.Advances in neural information pro- cessing systems, 37:36777–36804, 2024

    Zinan Guo, Yanze Wu, Chen Zhuowei, Peng Zhang, Qian He, et al. Pulid: Pure and lightning id customization via contrastive alignment.Advances in neural information pro- cessing systems, 37:36777–36804, 2024. 2

  10. [10]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 6, 1

  11. [11]

    Realcustom: Narrowing real text word for real-time open-domain text-to-image customiza- tion

    Mengqi Huang, Zhendong Mao, Mingcong Liu, Qian He, and Yongdong Zhang. Realcustom: Narrowing real text word for real-time open-domain text-to-image customiza- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 7476–7485,

  12. [12]

    Resolving multi-condition confusion for finetuning-free personalized image generation

    Qihan Huang, Siming Fu, Jinlong Liu, Hao Jiang, Yipeng Yu, and Jie Song. Resolving multi-condition confusion for finetuning-free personalized image generation. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 3707–3714, 2025. 2

  13. [13]

    T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot

    Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hong- sheng Li. T2i-r1: Reinforcing image generation with col- laborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025. 3

  14. [14]

    Multi-concept customization of text-to-image diffusion

    Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023. 2

  15. [15]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

  16. [16]

    Photomaker: Customizing re- alistic human photos via stacked id embedding

    Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming- Ming Cheng, and Ying Shan. Photomaker: Customizing re- alistic human photos via stacked id embedding. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8640–8650, 2024. 2

  17. [17]

    Rich human feedback for text-to-image generation

    Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, et al. Rich human feedback for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19401–19411, 2024. 3

  18. [18]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.arXiv preprint arXiv:2505.05470, 2025. 3, 5, 6

  19. [19]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

  20. [20]

    Hpsv3: Towards wide-spectrum human preference score

    Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. arXiv preprint arXiv:2508.03789, 2025. 4, 6, 2

  21. [21]

    Realcustom++: Represent- ing images as real-word for real-time customization.arXiv preprint arXiv:2408.09744, 2024

    Zhendong Mao, Mengqi Huang, Fei Ding, Mingcong Liu, Qian He, and Yongdong Zhang. Realcustom++: Represent- ing images as real-word for real-time customization.arXiv preprint arXiv:2408.09744, 2024. 2

  22. [22]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 4, 6

  23. [23]

    Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 3

  24. [24]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. 1

  25. [25]

    Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint arXiv:2310.03739, 2023

    Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint arXiv:2310.03739, 2023. 3

  26. [26]

    Video diffusion align- ment via reward gradients.arXiv preprint arXiv:2407.08737,

    Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Kate- rina Fragkiadaki, and Deepak Pathak. Video diffusion align- ment via reward gradients.arXiv preprint arXiv:2407.08737,

  27. [27]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 4

  28. [28]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 2, 4, 1

  29. [29]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 3

  30. [30]

    Objects365: A large-scale, high-quality dataset for object detection

    Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 8430–8439, 2019. 4

  31. [31]

    Ominicontrol: Minimal and uni- versal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024

    Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and uni- versal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024. 2, 3, 5

  32. [32]

    Instantcharacter: Personalize any characters with a scalable diffusion transformer frame- work.arXiv preprint arXiv:2504.12395, 2025

    Jiale Tao, Yanbing Zhang, Qixun Wang, Yiji Cheng, Hao- fan Wang, Xu Bai, Zhengguang Zhou, Ruihuang Li, Linqing Wang, Chunyu Wang, et al. Instantcharacter: Personalize any characters with a scalable diffusion transformer frame- work.arXiv preprint arXiv:2504.12395, 2025. 2

  33. [33]

    x-flux, 2025

    XLabs AI team. x-flux, 2025. Accessed: 2025-02-07. 1, 4

  34. [34]

    InstantID: Zero-shot Identity-Preserving Generation in Seconds

    Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024. 2

  35. [35]

    Ms-diffusion: Multi-subject zero-shot im- age personalization with layout guidance.arXiv preprint arXiv:2406.07209, 2024

    Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang. Ms-diffusion: Multi-subject zero-shot im- age personalization with layout guidance.arXiv preprint arXiv:2406.07209, 2024. 3, 6

  36. [36]

    Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

    Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for sta- ble text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025. 3

  37. [37]

    Elite: Encoding visual con- cepts into textual embeddings for customized text-to-image generation

    Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, and Wangmeng Zuo. Elite: Encoding visual con- cepts into textual embeddings for customized text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15943–15953, 2023. 2

  38. [38]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 1, 3, 4, 6, 7, 8

  39. [39]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 1, 3, 4, 6, 7, 8, 2

  40. [40]

    Less-to-more gener- alization: Unlocking more controllability by in-context generation.arXiv preprint arXiv:2504.02160, 2025

    Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to-more generalization: Unlocking more controllability by in-context generation.arXiv preprint arXiv:2504.02160, 2025. 1, 2, 3, 4, 5, 6, 7

  41. [41]

    Omnigen: Unified image genera- tion

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025. 1

  42. [42]

    Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 3

  43. [43]

    VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

    Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shu- run Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. arXiv preprint arXiv:2412.21059, 2024

  44. [44]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025. 3, 6

  45. [45]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 4, 1

  46. [46]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,