arxiv: 2512.01236 · v2 · submitted 2025-12-01 · 💻 cs.CV

PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards

Shulei Wang , Longhui Wei , Xin He , Jianbo Ouyang , Hui Lu , Zhou Zhao , Qi Tian This is my paper

Pith reviewed 2026-05-17 03:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-subject personalizationimage generationsubject consistencyreinforcement learningsynthetic datatext-to-imagepersonalized generation

0 comments

The pith

A synthetic data pipeline plus pairwise rewards lets single-subject image models handle multiple subjects with better consistency and prompt following.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Single-subject personalized image generators work well but lose performance when asked to place several subjects in one scene while matching a text prompt. The authors build a pipeline that uses existing single-subject models to create large-scale synthetic multi-subject training examples. They then run a reinforcement learning stage guided by Pairwise Subject-Consistency Rewards that penalize identity drift between subjects and by general rewards that keep the output faithful to the prompt. A new benchmark with seven subsets measures success across consistency, controllability, and other axes. The central claim is that this combination scales multi-subject personalization without requiring new real-world multi-subject datasets.

Core claim

A scalable pipeline first generates diverse multi-subject training data by prompting strong single-subject models, allowing those models to learn multi-image and multi-subject synthesis. A subsequent reinforcement learning stage then applies Pairwise Subject-Consistency Rewards together with general-purpose rewards to jointly improve subject identity preservation and text controllability, producing models that maintain consistency across multiple subjects while following complex prompts.

What carries the argument

The Pairwise Subject-Consistency Rewards, which measure and reward identity agreement between every pair of subjects within generated images during the reinforcement learning stage.

If this is right

Single-subject models can be upgraded to multi-subject use without collecting new paired training data.
The introduced benchmark supplies a standardized test covering seven subsets and three evaluation dimensions for future multi-subject methods.
Adding general-purpose rewards alongside consistency rewards improves both identity fidelity and prompt adherence at the same time.
The data-generation pipeline can be repeated at larger scale to keep improving performance as base single-subject models advance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pairwise-reward idea could be tested on video or 3D generation where multiple subjects must stay consistent across frames or views.
If the synthetic data proves robust, the method lowers the barrier to personalized generation for users who lack access to multi-person reference photos.
Future extensions might combine these rewards with other objectives such as aesthetic or safety constraints without changing the core pipeline.

Load-bearing premise

Synthetic images produced by single-subject models contain enough variety and lack systematic biases that would prevent the rewards from teaching reliable multi-subject behavior.

What would settle it

Run the trained model on a held-out set of real photographs showing multiple distinct people and measure whether the generated subjects retain their identities across varied prompts and spatial arrangements.

Figures

Figures reproduced from arXiv: 2512.01236 by Hui Lu, Jianbo Ouyang, Longhui Wei, Qi Tian, Shulei Wang, Xin He, Zhou Zhao.

**Figure 1.** Figure 1: Quantitative comparison of recent methods on PSRBench across three evaluation dimensions: subject consistency, aesthetic [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the dataset construction pipeline. The process consists of two stages: (1) multi-subject paired image generation, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Left: Scalable frame-wise positional encoding. Middle: Pairwise subject-consistency rewards. Right: GRPO training pipeline [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative analysis results of PSR with recent state-of-the-art models. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of PSRBench, with a case from each subset shown on the left and the three evaluation dimensions for each subset [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative Analysis on DreamBench [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Instruction template used for providing to Qwen3 to construct the dataset. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Instruction template used for providing to Qwen2.5-VL to evaluate the semantic alignment scores of different subsets. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of different metrics [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of different metrics [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

read the original abstract

Personalized generation models for a single subject have demonstrated remarkable effectiveness, highlighting their significant potential. However, when extended to multiple subjects, existing models often exhibit degraded performance, particularly in maintaining subject consistency and adhering to textual prompts. We attribute these limitations to the absence of high-quality multi-subject datasets and refined post-training strategies. To address these challenges, we propose a scalable multi-subject data generation pipeline that leverages powerful single-subject generation models to construct diverse and high-quality multi-subject training data. Through this dataset, we first enable single-subject personalization models to acquire knowledge of synthesizing multi-image and multi-subject scenarios. Furthermore, to enhance both subject consistency and text controllability, we design a set of Pairwise Subject-Consistency Rewards and general-purpose rewards, which are incorporated into a refined reinforcement learning stage. To comprehensively evaluate multi-subject personalization, we introduce a new benchmark that assesses model performance using seven subsets across three dimensions. Extensive experiments demonstrate the effectiveness of our approach in advancing multi-subject personalized image generation. Github Link: https://github.com/wang-shulei/PSR

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical pipeline for turning single-subject generators into multi-subject training data plus pairwise rewards in RL, but the gains are hard to judge without the actual numbers or ablations.

read the letter

The core contribution here is a data pipeline that prompts single-subject models to create multi-subject image pairs, then folds those into an RL stage with pairwise subject-consistency rewards and some general rewards. They also release a new benchmark with seven subsets covering consistency, prompt adherence, and related axes. That combination is the part that has not been laid out exactly this way before, even though it builds on prior single-subject personalization and RLHF work for diffusion models. The github release is a plus for anyone who wants to inspect the implementation directly. The approach is straightforward and targets a real pain point: single-subject models degrade when you ask them to handle several subjects at once in the same scene. The pipeline and reward design give a concrete way to scale without starting from scratch on new multi-subject data collection. That part is useful for applied work. The main weakness is that the abstract and available summary contain no quantitative metrics, no baseline comparisons, and no ablation results. Claims about effectiveness therefore rest on results that are not shown here, which makes it difficult to separate the effect of the new rewards from the effect of simply having more training pairs. The stress-test concern about upstream single-subject biases carrying into interaction handling and compositions looks plausible on the description; if the synthetic pairs systematically under-represent occlusions or spatial relations, the rewards could reinforce those patterns rather than fix them. A mild circularity risk also exists if reward weights or filters were tuned with any overlap to the evaluation sets. This paper is aimed at people already working on personalized image generation who need to move from one subject to several. Readers focused on data synthesis pipelines or reward design in generative RL would get the most out of the method details and the benchmark construction. It is coherent enough on its own terms to deserve a serious referee, mainly because the benchmark could be adopted by others and the pipeline is simple enough to test. I would send it to review but would ask the authors for the full metric tables, ablations on the reward components, and any checks on the quality of the synthetic pairs before accepting.

Referee Report

3 major / 2 minor

Summary. The paper claims to advance multi-subject personalized image generation by proposing a scalable data generation pipeline that uses existing single-subject models to synthesize diverse multi-subject training pairs, followed by a reinforcement learning stage incorporating Pairwise Subject-Consistency Rewards (plus general-purpose rewards) to improve subject consistency and text controllability. It also introduces a new benchmark consisting of seven subsets across three evaluation dimensions and reports that extensive experiments demonstrate the effectiveness of the overall approach.

Significance. If the central claims hold after addressing data-quality concerns, the work would be significant for the field: it offers a practical route to scale personalization beyond single subjects without requiring large-scale real multi-subject datasets, and the new benchmark could become a useful standard for future comparisons. The combination of synthetic data construction with reward-driven RL is a reasonable engineering contribution, though its impact depends on whether the synthetic pairs truly capture inter-subject relations without systematic artifacts.

major comments (3)

[§3] §3 (Multi-Subject Data Generation Pipeline): The pipeline constructs training pairs by prompting single-subject models; the manuscript must include a quantitative audit (e.g., human preference scores or automated metrics on interaction fidelity, occlusion handling, and prompt adherence) comparing the synthetic pairs against real multi-subject photographs or artist-created references. Without this, the claim that the data is “diverse and high-quality” remains unverified and directly affects the reliability of the subsequent RL stage.
[§5] §5 (Experiments and Benchmark): The abstract states that “extensive experiments demonstrate the effectiveness,” yet the provided description supplies no numerical results, baseline comparisons, or ablation tables. The manuscript should report concrete metrics (e.g., subject consistency scores, CLIP text alignment, FID) on the new benchmark’s seven subsets, together with statistical significance tests against at least two strong baselines (e.g., DreamBooth extended to multi-subject and a recent multi-subject method).
[§4.2] §4.2 (Pairwise Subject-Consistency Rewards): The rewards are defined on generated images that themselves depend on the synthetic training distribution; if reward weights or filtering thresholds were tuned using the same evaluation sets used for final reporting, the reported gains risk circularity. The authors should state explicitly whether any hyper-parameters were selected on held-out data or via cross-validation and provide the exact reward formulation (including any learned components).

minor comments (2)

The abstract mentions “seven subsets across three dimensions” but does not name the dimensions or list the subsets; this information should appear in the abstract or in a dedicated table in §5.
Figure captions and axis labels in the experimental section should explicitly indicate which metric corresponds to which benchmark subset to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [§3] §3 (Multi-Subject Data Generation Pipeline): The pipeline constructs training pairs by prompting single-subject models; the manuscript must include a quantitative audit (e.g., human preference scores or automated metrics on interaction fidelity, occlusion handling, and prompt adherence) comparing the synthetic pairs against real multi-subject photographs or artist-created references. Without this, the claim that the data is “diverse and high-quality” remains unverified and directly affects the reliability of the subsequent RL stage.

Authors: We agree that an explicit quantitative audit of the synthetic data would strengthen the paper. In the revised version we add a dedicated subsection with both human preference scores (n=200 raters) and automated metrics (CLIP-based interaction fidelity, occlusion detection via segmentation, and prompt adherence) comparing our generated pairs against a held-out set of real multi-subject photographs and artist references. These results confirm that the synthetic pairs achieve comparable or superior fidelity on the targeted dimensions. revision: yes
Referee: [§5] §5 (Experiments and Benchmark): The abstract states that “extensive experiments demonstrate the effectiveness,” yet the provided description supplies no numerical results, baseline comparisons, or ablation tables. The manuscript should report concrete metrics (e.g., subject consistency scores, CLIP text alignment, FID) on the new benchmark’s seven subsets, together with statistical significance tests against at least two strong baselines (e.g., DreamBooth extended to multi-subject and a recent multi-subject method).

Authors: The full manuscript already contains numerical results, ablation tables, and comparisons on the seven benchmark subsets. To make these findings more prominent and directly responsive to the comment, we have added a consolidated results table reporting subject consistency, CLIP text alignment, and FID scores, together with paired t-test significance values against DreamBooth (multi-subject extension) and a recent multi-subject baseline. We also include the full per-subset breakdown. revision: yes
Referee: [§4.2] §4.2 (Pairwise Subject-Consistency Rewards): The rewards are defined on generated images that themselves depend on the synthetic training distribution; if reward weights or filtering thresholds were tuned using the same evaluation sets used for final reporting, the reported gains risk circularity. The authors should state explicitly whether any hyper-parameters were selected on held-out data or via cross-validation and provide the exact reward formulation (including any learned components).

Authors: We confirm that all reward weights and filtering thresholds were selected exclusively on a held-out validation split that is disjoint from both the training data and the final test benchmark. The exact mathematical formulation of the Pairwise Subject-Consistency Rewards (including the learned component) is already given in Section 4.2; we have added an explicit paragraph in the revised manuscript stating the data-separation protocol and cross-validation procedure to eliminate any ambiguity regarding circularity. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation builds on external single-subject models with independent rewards and benchmark

full rationale

The paper describes generating multi-subject training data via existing single-subject models, then applying newly designed Pairwise Subject-Consistency Rewards within an RL stage, followed by evaluation on a new benchmark with seven subsets. No equations, definitions, or steps in the provided text reduce the claimed performance gains to fitted parameters on the same data, self-citations that bear the central load, or ansatzes smuggled from prior author work. The pipeline and rewards are presented as additive refinements rather than tautological redefinitions of the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that synthetic data from single-subject models transfers well and that the designed rewards are effective without extensive hyperparameter search details being provided in the abstract.

free parameters (1)

reward weights
Weights balancing pairwise consistency rewards against general-purpose rewards are likely tuned; exact values not stated in abstract.

axioms (1)

domain assumption Single-subject personalization models can generate sufficiently diverse and unbiased multi-subject training images.
Invoked when describing the scalable data generation pipeline.

pith-pipeline@v0.9.0 · 5500 in / 1094 out tokens · 70161 ms · 2026-05-17T03:46:11.544879+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose a scalable multi-subject data generation pipeline that leverages powerful single-subject generation models to construct diverse and high-quality multi-subject training data... Pairwise Subject-Consistency Rewards... GRPO training pipeline
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Pairwise Subject-Consistency Reward (PSR)... R_PSR = 1/N sum f(I_i_dec, I_i_gt)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
cs.CV 2026-05 unverdicted novelty 7.0

UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 1 Pith paper · 18 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4, 6, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation.arXiv preprint arXiv:2506.21416, 2025

Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, and Xinglong Wu. Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation.arXiv preprint arXiv:2506.21416, 2025. 2, 3, 4, 6, 7, 1

work page arXiv 2025
[3]

Directly Fine-Tuning Diffusion Models on Differentiable Rewards

Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable re- wards.arXiv preprint arXiv:2309.17400, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, and Xihui Liu. Got-r1: Unleashing reasoning capability of mllm for vi- sual generation with reinforcement learning.arXiv preprint arXiv:2505.17022, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362, 2023

Ying Fan and Kangwook Lee. Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362, 2023

work page arXiv 2023
[6]

Dpok: Reinforcement learning for fine-tuning text-to-image diffu- sion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffu- sion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023. 3

work page 2023
[7]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Pulid: Pure and lightning id customization via contrastive alignment.Advances in neural information pro- cessing systems, 37:36777–36804, 2024

Zinan Guo, Yanze Wu, Chen Zhuowei, Peng Zhang, Qian He, et al. Pulid: Pure and lightning id customization via contrastive alignment.Advances in neural information pro- cessing systems, 37:36777–36804, 2024. 2

work page 2024
[10]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 6, 1

work page 2022
[11]

Realcustom: Narrowing real text word for real-time open-domain text-to-image customiza- tion

Mengqi Huang, Zhendong Mao, Mingcong Liu, Qian He, and Yongdong Zhang. Realcustom: Narrowing real text word for real-time open-domain text-to-image customiza- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 7476–7485,

work page
[12]

Resolving multi-condition confusion for finetuning-free personalized image generation

Qihan Huang, Siming Fu, Jinlong Liu, Hao Jiang, Yipeng Yu, and Jie Song. Resolving multi-condition confusion for finetuning-free personalized image generation. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 3707–3714, 2025. 2

work page 2025
[13]

T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot

Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hong- sheng Li. T2i-r1: Reinforcing image generation with col- laborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025. 3

work page arXiv 2025
[14]

Multi-concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023. 2

work page 1931
[15]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Photomaker: Customizing re- alistic human photos via stacked id embedding

Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming- Ming Cheng, and Ying Shan. Photomaker: Customizing re- alistic human photos via stacked id embedding. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8640–8650, 2024. 2

work page 2024
[17]

Rich human feedback for text-to-image generation

Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, et al. Rich human feedback for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19401–19411, 2024. 3

work page 2024
[18]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.arXiv preprint arXiv:2505.05470, 2025. 3, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

work page
[20]

Hpsv3: Towards wide-spectrum human preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. arXiv preprint arXiv:2508.03789, 2025. 4, 6, 2

work page arXiv 2025
[21]

Realcustom++: Represent- ing images as real-word for real-time customization.arXiv preprint arXiv:2408.09744, 2024

Zhendong Mao, Mengqi Huang, Fei Ding, Mingcong Liu, Qian He, and Yongdong Zhang. Realcustom++: Represent- ing images as real-word for real-time customization.arXiv preprint arXiv:2408.09744, 2024. 2

work page arXiv 2024
[22]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 3

work page 2022
[24]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint arXiv:2310.03739, 2023

Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint arXiv:2310.03739, 2023. 3

work page arXiv 2023
[26]

Video diffusion align- ment via reward gradients.arXiv preprint arXiv:2407.08737,

Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Kate- rina Fragkiadaki, and Deepak Pathak. Video diffusion align- ment via reward gradients.arXiv preprint arXiv:2407.08737,

work page arXiv
[27]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 4

work page 2021
[28]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 2, 4, 1

work page 2023
[29]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 3

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Objects365: A large-scale, high-quality dataset for object detection

Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 8430–8439, 2019. 4

work page 2019
[31]

Ominicontrol: Minimal and uni- versal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and uni- versal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024. 2, 3, 5

work page arXiv 2024
[32]

Instantcharacter: Personalize any characters with a scalable diffusion transformer frame- work.arXiv preprint arXiv:2504.12395, 2025

Jiale Tao, Yanbing Zhang, Qixun Wang, Yiji Cheng, Hao- fan Wang, Xu Bai, Zhengguang Zhou, Ruihuang Li, Linqing Wang, Chunyu Wang, et al. Instantcharacter: Personalize any characters with a scalable diffusion transformer frame- work.arXiv preprint arXiv:2504.12395, 2025. 2

work page arXiv 2025
[33]

x-flux, 2025

XLabs AI team. x-flux, 2025. Accessed: 2025-02-07. 1, 4

work page 2025
[34]

InstantID: Zero-shot Identity-Preserving Generation in Seconds

Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024. 2

work page internal anchor Pith review arXiv 2024
[35]

Ms-diffusion: Multi-subject zero-shot im- age personalization with layout guidance.arXiv preprint arXiv:2406.07209, 2024

Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang. Ms-diffusion: Multi-subject zero-shot im- age personalization with layout guidance.arXiv preprint arXiv:2406.07209, 2024. 3, 6

work page arXiv 2024
[36]

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for sta- ble text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Elite: Encoding visual con- cepts into textual embeddings for customized text-to-image generation

Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, and Wangmeng Zuo. Elite: Encoding visual con- cepts into textual embeddings for customized text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15943–15953, 2023. 2

work page 2023
[38]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 1, 3, 4, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 1, 3, 4, 6, 7, 8, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Less-to-more gener- alization: Unlocking more controllability by in-context generation.arXiv preprint arXiv:2504.02160, 2025

Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to-more generalization: Unlocking more controllability by in-context generation.arXiv preprint arXiv:2504.02160, 2025. 1, 2, 3, 4, 5, 6, 7

work page arXiv 2025
[41]

Omnigen: Unified image genera- tion

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025. 1

work page 2025
[42]

Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 3

work page 2023
[43]

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shu- run Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. arXiv preprint arXiv:2412.21059, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 4, 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,

work page internal anchor Pith review Pith/arXiv arXiv