PortraitGen: Exemplar-Driven GRPO with Dual-Reward Guidance for Photorealistic Portrait Generation

Chen Li; Huchuan Lu; Jing Lyu; Qian Liang; Xiaomin Li; Xu Jia; Yinan Li; Ying Zhang

arxiv: 2606.26930 · v1 · pith:RF5V6DHQnew · submitted 2026-06-25 · 💻 cs.CV

PortraitGen: Exemplar-Driven GRPO with Dual-Reward Guidance for Photorealistic Portrait Generation

Xiaomin Li , Qian Liang , Yinan Li , Ying Zhang , Chen Li , Jing Lyu , Huchuan Lu , Xu Jia This is my paper

Pith reviewed 2026-06-26 05:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords portrait generationGRPOreinforcement learningphotorealismAI artifactstext-to-imagedual rewardimage inversion

0 comments

The pith

Inserting inverted real images into GRPO sampling groups plus dual rewards breaks the model's original distribution and removes fine-grained AI artifacts in portraits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that standard GRPO post-training stays trapped inside the model's starting distribution and therefore cannot fix subtle flaws such as oily skin or other biological implausibilities. By directly adding inverted real photographs to each GRPO group and scoring outputs with two new rewards—one for general quality and one for human-specific fidelity—the method steers generation toward photorealism. A reader should care because current text-to-image RL systems still produce visibly synthetic results even after aesthetic tuning, limiting practical use for portrait work. The authors also release a portrait-specific benchmark to measure these improvements.

Core claim

PortraitGen demonstrates that directly introducing real images into GRPO sampling groups via inversion, combined with an OmniReward for overall quality and an AI-Portrait reward for human-centric details, allows the policy to escape its original generative distribution and suppress AI artifacts that prior methods leave unresolved.

What carries the argument

Exemplar-driven GRPO that inserts inverted real images into sampling groups, guided by the dual-reward pair OmniReward and AI-Portrait.

If this is right

Generated portraits exhibit measurably fewer AI artifacts than those from standard GRPO or other baselines.
The method produces higher human-centric fidelity scores on the new PortraitBench benchmark.
Real-image exemplars can be reused across multiple GRPO iterations without retraining from scratch.
The dual-reward structure can be applied to other fine-grained image domains beyond portraits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same inversion-plus-dual-reward pattern might extend to non-portrait domains such as landscapes or product photography if analogous real-image exemplars and domain-specific rewards are supplied.
If the approach scales without extra hyperparameter search, it could shorten the iteration cycle between model release and production-quality output.
Future work could test whether the benefit persists when the number of real exemplars per group is reduced below the value used in the reported experiments.

Load-bearing premise

Directly adding inverted real images to GRPO groups together with the two new rewards will reliably push the model outside its starting distribution and remove artifacts without creating fresh failure modes.

What would settle it

Run the same portrait prompts on PortraitGen and on the prior GRPO baseline; if the rate of oily skin, unnatural eyes, or other listed artifacts remains statistically unchanged, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2606.26930 by Chen Li, Huchuan Lu, Jing Lyu, Qian Liang, Xiaomin Li, Xu Jia, Yinan Li, Ying Zhang.

**Figure 1.** Figure 1: Training data samples for OmniReward and AI-Portrait Reward. “Instruction” denotes the user prompt, with <think> and <answer> tags enclosing the assistant’s response. The reasoning process in OmniReward is truncated here for brevity. 2.3 Fine-tuning T2I Models with Rewards Recent T2I advancements focus on objective alignment via Reinforcement Learning (RL) [14, 15, 24]. Several approaches optimize base mo… view at source ↗

**Figure 2.** Figure 2: Quantitative scoring comparison between real and synthetic images using different reward models. Red dashed circles indicate severe structural distortions within the synthetic generation. Zoom in for best view. where θ denotes the learnable parameters and y = {y1, y2, . . . , yT } represents the serialized output sequence containing both the explicit reasoning trace and the final quantitative score. Evalu… view at source ↗

**Figure 3.** Figure 3: Overview of Exemplar-Driven GRPO. The T2I model generates G − 1 images, which form a group alongside the exemplar image reconstructed via BELM inversion. Reward scores are then computed using OmniReward and AI-Portrait Reward. A gating mechanism is applied to selectively use these rewards or integrate new ones. through its inherent capacity ceiling. Because the T2I model has not genuinely observed real im… view at source ↗

**Figure 4.** Figure 4: Distribution of our PortraitBench. Left: Categorical distribution of portrait scenarios. The benchmark encompasses diverse demographic groups, with percentages indicating their relative proportions. Right: Word cloud visualization of the benchmark. uates each pair to determine which image exhibits a more pronounced synthetic appearance. The image displaying stronger generative artifacts is penalized as the… view at source ↗

**Figure 5.** Figure 5: Qualitative comparisons between PortraitGen and other methods. The icon denotes the superior generated image for each specific text prompt. Red dashed circles highlight obvious structural distortions in the generated limbs. significant enhancement in the OmniReward Content metric, a specific criterion for identifying synthetic traces. The highest Content value strongly validates that our strategy effective… view at source ↗

**Figure 7.** Figure 7: User study. Win rate comparisons between our method and baseline models across three evaluation dimensions. 5.4 Ablation Study To verify the effectiveness of our proposed method, we conduct ablation studies on PortraitGen to evaluate the individual contributions of the proposed reward models and the Exemplar-Driven GRPO. As detailed in Tab. 3, our complete approach achieves superior performance across mos… view at source ↗

read the original abstract

Reinforcement Learning like Group Relative Policy Optimization (GRPO) has significantly advanced text-to-image post-training. However, current methods often favor superficial aesthetics, such as over-saturated colors, leaving critical flaws like AI artifacts and biological implausibilities unresolved. We attribute these limitations to two primary factors: (1) The absence of real images during post-training confines GRPO sampling to the original distribution, failing to break inherent generative boundaries; (2) the optimization process lacks specific rewards targeting fine-grained artifacts like overly oily skin and other AI artifacts. To address this, we propose PortraitGen, a novel framework tailored for photorealistic portrait generation. First, we break inherent generative boundaries by directly introducing real images into the GRPO sampling groups, where image inversion is employed to obtain their transition probabilities and latents. Second, to explicitly steer the model toward photorealism, we introduce a complementary dual-reward mechanism: OmniReward for general quality and AI-Portrait for human-centric fidelity. Furthermore, we curate PortraitBench, a comprehensive portrait-centric benchmark. Extensive experiments demonstrate that PortraitGen significantly outperforms existing baselines, effectively suppressing AI artifacts and achieving unprecedented photorealism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds real-image inversion into GRPO groups plus a dual-reward split for portraits, but the inversion step likely keeps updates inside the original distribution and the abstract gives no numbers to check the gains.

read the letter

The core move here is inserting inverted real images directly into GRPO sampling groups so the policy sees real data during post-training, paired with a split reward (OmniReward for general quality and AI-Portrait for face-specific fidelity) and a new PortraitBench. That combination is not in the prior GRPO image papers cited in the abstract.

The approach is sensible on paper: current RL post-training drifts toward saturated outputs and leaves skin texture and biological errors untouched, and the authors correctly flag that sampling stays inside the pretrained support. Adding a human-centric reward is a straightforward way to target the narrow failure mode they care about.

The soft spot is the inversion claim. Standard DDIM-style inversion runs the model’s own noise predictor on the real image, so the resulting latents and transition probabilities are still conditioned on the original distribution. If that holds, the GRPO updates cannot reliably push the model outside its support, which undercuts the premise that this breaks generative boundaries. The abstract does not show the exact inversion procedure or any ablation that isolates this step, so it is impossible to tell whether the reported gains come from the real-image insertion or from the new reward terms alone.

No quantitative tables, training curves, or hyperparameter details appear in the provided text, which makes the “significantly outperforms” and “unprecedented photorealism” statements hard to evaluate. The benchmark itself is a useful addition for the portrait niche.

This is aimed at groups already running RL fine-tuning on diffusion models and who need a face-specific recipe. A reader working on artifact suppression in portraits would find the dual-reward idea and the benchmark worth looking at. The work shows clear engagement with the practical failure modes, so it is coherent on its own terms even if the central mechanism needs verification.

I would send it to peer review so the inversion math and the experimental controls can be checked properly.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces PortraitGen, a framework for photorealistic portrait generation that extends Group Relative Policy Optimization (GRPO) post-training. It identifies two limitations in prior work—sampling confined to the original model distribution and lack of rewards for fine-grained artifacts—and proposes to address them by (1) inserting real images into GRPO groups via image inversion to obtain latents and transition probabilities, and (2) adding a dual-reward mechanism (OmniReward for general quality and AI-Portrait for human-centric fidelity). The authors also curate PortraitBench and report that the method outperforms baselines in artifact suppression.

Significance. If the inversion step supplies statistics outside the pretrained support and the dual rewards demonstrably reduce specific artifacts without introducing new failure modes, the approach could provide a practical route to improving photorealism in RL-tuned generative models, especially for human portraits where biological implausibilities are costly.

major comments (1)

[Abstract] Abstract: The claim that 'directly introducing real images into the GRPO sampling groups, where image inversion is employed to obtain their transition probabilities and latents' breaks inherent generative boundaries assumes these probabilities are not still conditioned on the pretrained distribution. Standard inversion (DDIM or equivalent) derives latents and transition probabilities by running the model's own forward process or noise predictor on the real image; if this holds, the GRPO updates remain inside the original support and cannot reliably eliminate fine-grained artifacts as asserted.

minor comments (1)

The abstract would be strengthened by naming the exact inversion procedure and any modifications to the GRPO group construction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful analysis of our abstract claim. We address the single major comment below and will make corresponding revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'directly introducing real images into the GRPO sampling groups, where image inversion is employed to obtain their transition probabilities and latents' breaks inherent generative boundaries assumes these probabilities are not still conditioned on the pretrained distribution. Standard inversion (DDIM or equivalent) derives latents and transition probabilities by running the model's own forward process or noise predictor on the real image; if this holds, the GRPO updates remain inside the original support and cannot reliably eliminate fine-grained artifacts as asserted.

Authors: We agree that standard DDIM-style inversion computes latents and transition probabilities using the pretrained model's noise predictor, so the resulting latents remain within the original support. The manuscript's phrasing that this 'breaks inherent generative boundaries' is therefore imprecise. The intended mechanism is that real-image latents are mixed into each GRPO group; the dual rewards then produce relative rankings that include these real exemplars, allowing the policy gradient to shift generation toward photorealistic outputs even though each individual sample is still drawn from the model. We will revise the abstract (and the corresponding methods paragraph) to remove the 'breaks boundaries' claim, replace it with a clearer description of exemplar mixing and reward-driven ranking, and add a short discussion of the support limitation. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on empirical assumptions rather than self-referential derivations

full rationale

The paper proposes PortraitGen by inserting inverted real images into GRPO groups and adding OmniReward plus AI-Portrait rewards to address AI artifacts. No equations appear that define a quantity in terms of itself or rename a fitted parameter as a prediction. The inversion step is presented as a methodological choice to supply external statistics, not as a derivation that reduces to the pretrained model by construction. No self-citation chains, uniqueness theorems, or ansatzes smuggled via prior work are invoked in the abstract or described claims. The central argument is therefore an independent proposal whose validity is left to experimental validation rather than tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5758 in / 997 out tokens · 18108 ms · 2026-06-26T05:12:27.539338+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 32 canonical work pages · 19 internal anchors

[1]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

IEEE transactions on pattern analysis and machine intelligence47(3), 2212–2231 (2024) 3

Bie, F., Yang, Y., Zhou, Z., Ghanem, A., Zhang, M., Yao, Z., Wu, X., Holmes, C., Golnari, P., Clifton, D.A., et al.: Renaissance: A survey into ai text-to-image generation in the era of large model. IEEE transactions on pattern analysis and machine intelligence47(3), 2212–2231 (2024) 3

2024
[3]

Training Diffusion Models with Reinforcement Learning

Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2023) 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Cai, H., Cao, S., Du, R., Gao, P., Hoi, S., Hou, Z., Huang, S., Jiang, D., Jin, X., Li, L., et al.: Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699 (2025) 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Directly Fine-Tuning Diffusion Models on Differentiable Rewards

Clark, K., Vicol, P., Swersky, K., Fleet, D.J.: Directly fine-tuning diffusion models on differentiable rewards. arXiv preprint arXiv:2309.17400 (2023) 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 3, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

In: Forty-first international conference on machine learning (2024) 13

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024) 13

2024
[8]

Advances in Neural Information Processing Sys- tems36, 79858–79885 (2023) 1

Fan, Y., Watkins, O., Du, Y., Liu, H., Ryu, M., Boutilier, C., Abbeel, P., Ghavamzadeh, M., Lee, K., Lee, K.: Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. Advances in Neural Information Processing Sys- tems36, 79858–79885 (2023) 1

2023
[9]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

IEEE Transactions on Visualization and Computer Graphics31(10), 9464–9483 (2025).https://doi.org/10.1109/TVCG.2025.35850773

Hartwig, S., Engel, D., Sick, L., Kniesel, H., Payer, T., Poonam, P., Glöckler, M., Bäuerle, A., Ropinski, T.: A survey on quality metrics for text-to-image generation. IEEE Transactions on Visualization and Computer Graphics31(10), 9464–9483 (2025).https://doi.org/10.1109/TVCG.2025.35850773

work page doi:10.1109/tvcg.2025.35850773 2025
[11]

He, J., Geng, Y., Bo, L.: Uniportrait: A unified framework for identity-preserving single- and multi-human image personalization (2024),https://arxiv.org/abs/ 2408.059392

work page arXiv 2024
[12]

Advances in neural information processing systems30(2017) 3

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017) 3

2017
[13]

Advances in neural information processing systems33, 6840–6851 (2020) 3

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) 3

2020
[14]

arXiv preprint arXiv:2505.00703 (2025) 2, 4

Jiang, D., Guo, Z., Zhang, R., Zong, Z., Li, H., Zhuo, L., Yan, S., Heng, P.A., Li, H.: T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot. arXiv preprint arXiv:2505.00703 (2025) 2, 4

work page arXiv 2025
[15]

Kaufmann, T., Weng, P., Bengs, V., Hüllermeier, E.: A survey of reinforcement learning from human feedback (2025),https://arxiv.org/abs/2312.149254

work page arXiv 2025
[16]

Advances in neural information processing systems36, 36652–36663 (2023) 1, 3, 11 16 X

Kirstain,Y.,Polyak,A.,Singer,U.,Matiana,S.,Penna,J.,Levy,O.:Pick-a-pic:An open dataset of user preferences for text-to-image generation. Advances in neural information processing systems36, 36652–36663 (2023) 1, 3, 11 16 X. Li et al

2023
[17]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Ku, M., Jiang, D., Wei, C., Yue, X., Chen, W.: Viescore: Towards explainable metrics for conditional image synthesis evaluation. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 12268–12290 (2024) 3

2024
[18]

Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024) 2, 3, 6, 13, 14

2024
[19]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025) 2, 3

2025
[20]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023) 3

2023
[21]

In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR)

Li, X., Liu, Y., Isobe, T., Jia, X., Cui, Q., Zhou, D., Li, D., He, Y., Lu, H., Wang, Z., Barsoum, E.: Reneg: Learning negative embedding with reward guidance. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR). pp. 23636–23645 (June 2025) 4

2025
[22]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR)

Li, Y., Liu, X., Kag, A., Hu, J., Idelbayev, Y., Sagar, D., Wang, Y., Tulyakov, S., Ren, J.: Textcraftor: Your text encoder can be image quality controller. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR). pp. 7985–7995 (June 2024) 4

2024
[23]

Li, Z., Cao, M., Wang, X., Qi, Z., Cheng, M.M., Shan, Y.: Photomaker: Customiz- ing realistic human photos via stacked id embedding (2023),https://arxiv.org/ abs/2312.044612

work page arXiv 2023
[24]

arXiv preprint arXiv:2508.11433 (2025) 2, 4

Liang, Q., Wu, Y., Li, K., Wei, J., He, S., Guo, J., Xie, N.: Mm-r1: Unleashing the power of unified multimodal large language models for personalized image generation. arXiv preprint arXiv:2508.11433 (2025) 2, 4

work page arXiv 2025
[25]

arXiv preprint arXiv:2503.23907 (2025) 2

Liao, Z., Liu, X., Qin, W., Li, Q., Wang, Q., Wan, P., Zhang, D., Zeng, L., Feng, P.: Humanaesexpert: Advancing a multi-modality foundation model for human image aesthetic assessment. arXiv preprint arXiv:2503.23907 (2025) 2

work page arXiv 2025
[26]

Flow-GRPO: Training Flow Matching Models via Online RL

Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P., Zhang, D., Ouyang, W.: Flow-grpo: Training flow matching models via online rl. arXiv preprint arXiv:2505.05470 (2025) 1, 2, 4, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Advances in neural information processing systems35, 5775–5787 (2022) 3

Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in neural information processing systems35, 5775–5787 (2022) 3

2022
[28]

URL https://arxiv

Ma, Y., Wu, X., Sun, K., Li, H.: Hpsv3: Towards wide-spectrum human preference score, 2025. URL https://arxiv. org/abs/2508.0378941, 3

work page arXiv 2025
[29]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 3, 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 2, 3, 13

2021
[31]

Advances in neural information processing systems35, 25278–25294 (2022) 3

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large- scale dataset for training next generation image-text models. Advances in neural information processing systems35, 25278–25294 (2022) 3

2022
[32]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Seedream, T., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427 (2025) 3 PortraitGen 17

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024) 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

OpenAI GPT-5 System Card

Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025) 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Denoising Diffusion Implicit Models

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020) 3, 8

work page internal anchor Pith review Pith/arXiv arXiv 2010
[36]

Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024) 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wallace, B., Dang, M., Rafailov, R., Zhou, L., Lou, A., Purushwalkam, S., Ermon, S., Xiong, C., Joty, S., Naik, N.: Diffusion model alignment using direct preference optimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8228–8238 (2024) 1

2024
[38]

Advances in Neural Information Processing Systems37, 46118–46159 (2024) 8

Wang, F., Yin, H., Dong, Y.J., Zhu, H., Zhao, H., Qian, H., Li, C., et al.: Belm: Bidirectional explicit linear multi-step sampler for exact inversion in diffusion mod- els. Advances in Neural Information Processing Systems37, 46118–46159 (2024) 8

2024
[39]

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Wang, Y., Li, Z., Zang, Y., Zhou, Y., Bu, J., Wang, C., Lu, Q., Jin, C., Wang, J.: Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image rein- forcement learning. arXiv preprint arXiv:2508.20751 (2025) 2, 4, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Unified Reward Model for Multimodal Understanding and Generation

Wang, Y., Zang, Y., Li, H., Jin, C., Wang, J.: Unified reward model for multimodal understanding and generation. arXiv preprint arXiv:2503.05236 (2025) 3, 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., Li, H.: Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023) 1, 2, 3, 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13294–13304 (2025) 3

2025
[44]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. arXiv preprint arXiv:2408.12528 (2024) 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Xiong, T., Wang, X., Guo, D., Ye, Q., Fan, H., Gu, Q., Huang, H., Li, C.: Llava- critic: Learning to evaluate multimodal models (2025),https://arxiv.org/abs/ 2410.027123

work page arXiv 2025
[46]

Advances in Neural Information Processing Systems36, 15903–15935 (2023) 1, 3, 4

Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagere- ward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems36, 15903–15935 (2023) 1, 3, 4

2023
[47]

DanceGRPO: Unleashing GRPO on Visual Generation

Xue, Z., Wu, J., Gao, Y., Kong, F., Zhu, L., Chen, M., Liu, Z., Liu, W., Guo, Q., Huang, W., et al.: Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818 (2025) 1, 2, 4, 8, 13, 14 18 X. Li et al

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

ACM computing surveys56(4), 1–39 (2023) 3

Yang,L.,Zhang,Z.,Song,Y.,Hong,S.,Xu,R.,Zhao,Y.,Zhang,W.,Cui,B.,Yang, M.H.: Diffusion models: A comprehensive survey of methods and applications. ACM computing surveys56(4), 1–39 (2023) 3

2023
[49]

arXiv preprint arXiv:2505.02527 (2025) 3

Yang, P., Cheung, N.M., Ma, X.: Text to image generation and editing: A survey. arXiv preprint arXiv:2505.02527 (2025) 3

work page arXiv 2025
[50]

arXiv preprint arXiv:2512.00473 (2025) 2, 4, 13

Ye, J., Zhu, L., Guo, Y., Jiang, D., Huang, Z., Zhang, Y., Yan, Z., Fu, H., He, C., Li, W.: Realgen: Photorealistic text-to-image generation via detector-guided rewards. arXiv preprint arXiv:2512.00473 (2025) 2, 4, 13

work page arXiv 2025
[51]

Yu, R., Wan, S., Wang, Y., Gao, C.X., Gan, L., Zhang, Z., Zhan, D.C.: Reward models in deep reinforcement learning: A survey (2025),https://arxiv.org/abs/ 2506.154213

work page arXiv 2025
[52]

arXiv preprint arXiv:2303.07909 (2023) 3

Zhang, C., Zhang, C., Zhang, M., Kweon, I.S., Kim, J.: Text-to-image diffusion models in generative ai: A survey. arXiv preprint arXiv:2303.07909 (2023) 3

work page arXiv 2023

[1] [1]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

IEEE transactions on pattern analysis and machine intelligence47(3), 2212–2231 (2024) 3

Bie, F., Yang, Y., Zhou, Z., Ghanem, A., Zhang, M., Yao, Z., Wu, X., Holmes, C., Golnari, P., Clifton, D.A., et al.: Renaissance: A survey into ai text-to-image generation in the era of large model. IEEE transactions on pattern analysis and machine intelligence47(3), 2212–2231 (2024) 3

2024

[3] [3]

Training Diffusion Models with Reinforcement Learning

Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2023) 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Cai, H., Cao, S., Du, R., Gao, P., Hoi, S., Hou, Z., Huang, S., Jiang, D., Jin, X., Li, L., et al.: Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699 (2025) 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Directly Fine-Tuning Diffusion Models on Differentiable Rewards

Clark, K., Vicol, P., Swersky, K., Fleet, D.J.: Directly fine-tuning diffusion models on differentiable rewards. arXiv preprint arXiv:2309.17400 (2023) 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 3, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

In: Forty-first international conference on machine learning (2024) 13

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024) 13

2024

[8] [8]

Advances in Neural Information Processing Sys- tems36, 79858–79885 (2023) 1

Fan, Y., Watkins, O., Du, Y., Liu, H., Ryu, M., Boutilier, C., Abbeel, P., Ghavamzadeh, M., Lee, K., Lee, K.: Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. Advances in Neural Information Processing Sys- tems36, 79858–79885 (2023) 1

2023

[9] [9]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

IEEE Transactions on Visualization and Computer Graphics31(10), 9464–9483 (2025).https://doi.org/10.1109/TVCG.2025.35850773

Hartwig, S., Engel, D., Sick, L., Kniesel, H., Payer, T., Poonam, P., Glöckler, M., Bäuerle, A., Ropinski, T.: A survey on quality metrics for text-to-image generation. IEEE Transactions on Visualization and Computer Graphics31(10), 9464–9483 (2025).https://doi.org/10.1109/TVCG.2025.35850773

work page doi:10.1109/tvcg.2025.35850773 2025

[11] [11]

He, J., Geng, Y., Bo, L.: Uniportrait: A unified framework for identity-preserving single- and multi-human image personalization (2024),https://arxiv.org/abs/ 2408.059392

work page arXiv 2024

[12] [12]

Advances in neural information processing systems30(2017) 3

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017) 3

2017

[13] [13]

Advances in neural information processing systems33, 6840–6851 (2020) 3

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) 3

2020

[14] [14]

arXiv preprint arXiv:2505.00703 (2025) 2, 4

Jiang, D., Guo, Z., Zhang, R., Zong, Z., Li, H., Zhuo, L., Yan, S., Heng, P.A., Li, H.: T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot. arXiv preprint arXiv:2505.00703 (2025) 2, 4

work page arXiv 2025

[15] [15]

Kaufmann, T., Weng, P., Bengs, V., Hüllermeier, E.: A survey of reinforcement learning from human feedback (2025),https://arxiv.org/abs/2312.149254

work page arXiv 2025

[16] [16]

Advances in neural information processing systems36, 36652–36663 (2023) 1, 3, 11 16 X

Kirstain,Y.,Polyak,A.,Singer,U.,Matiana,S.,Penna,J.,Levy,O.:Pick-a-pic:An open dataset of user preferences for text-to-image generation. Advances in neural information processing systems36, 36652–36663 (2023) 1, 3, 11 16 X. Li et al

2023

[17] [17]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Ku, M., Jiang, D., Wei, C., Yue, X., Chen, W.: Viescore: Towards explainable metrics for conditional image synthesis evaluation. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 12268–12290 (2024) 3

2024

[18] [18]

Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024) 2, 3, 6, 13, 14

2024

[19] [19]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025) 2, 3

2025

[20] [20]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023) 3

2023

[21] [21]

In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR)

Li, X., Liu, Y., Isobe, T., Jia, X., Cui, Q., Zhou, D., Li, D., He, Y., Lu, H., Wang, Z., Barsoum, E.: Reneg: Learning negative embedding with reward guidance. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR). pp. 23636–23645 (June 2025) 4

2025

[22] [22]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR)

Li, Y., Liu, X., Kag, A., Hu, J., Idelbayev, Y., Sagar, D., Wang, Y., Tulyakov, S., Ren, J.: Textcraftor: Your text encoder can be image quality controller. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR). pp. 7985–7995 (June 2024) 4

2024

[23] [23]

Li, Z., Cao, M., Wang, X., Qi, Z., Cheng, M.M., Shan, Y.: Photomaker: Customiz- ing realistic human photos via stacked id embedding (2023),https://arxiv.org/ abs/2312.044612

work page arXiv 2023

[24] [24]

arXiv preprint arXiv:2508.11433 (2025) 2, 4

Liang, Q., Wu, Y., Li, K., Wei, J., He, S., Guo, J., Xie, N.: Mm-r1: Unleashing the power of unified multimodal large language models for personalized image generation. arXiv preprint arXiv:2508.11433 (2025) 2, 4

work page arXiv 2025

[25] [25]

arXiv preprint arXiv:2503.23907 (2025) 2

Liao, Z., Liu, X., Qin, W., Li, Q., Wang, Q., Wan, P., Zhang, D., Zeng, L., Feng, P.: Humanaesexpert: Advancing a multi-modality foundation model for human image aesthetic assessment. arXiv preprint arXiv:2503.23907 (2025) 2

work page arXiv 2025

[26] [26]

Flow-GRPO: Training Flow Matching Models via Online RL

Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P., Zhang, D., Ouyang, W.: Flow-grpo: Training flow matching models via online rl. arXiv preprint arXiv:2505.05470 (2025) 1, 2, 4, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Advances in neural information processing systems35, 5775–5787 (2022) 3

Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in neural information processing systems35, 5775–5787 (2022) 3

2022

[28] [28]

URL https://arxiv

Ma, Y., Wu, X., Sun, K., Li, H.: Hpsv3: Towards wide-spectrum human preference score, 2025. URL https://arxiv. org/abs/2508.0378941, 3

work page arXiv 2025

[29] [29]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 3, 13

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 2, 3, 13

2021

[31] [31]

Advances in neural information processing systems35, 25278–25294 (2022) 3

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large- scale dataset for training next generation image-text models. Advances in neural information processing systems35, 25278–25294 (2022) 3

2022

[32] [32]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Seedream, T., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427 (2025) 3 PortraitGen 17

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024) 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

OpenAI GPT-5 System Card

Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025) 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Denoising Diffusion Implicit Models

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020) 3, 8

work page internal anchor Pith review Pith/arXiv arXiv 2010

[36] [36]

Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024) 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wallace, B., Dang, M., Rafailov, R., Zhou, L., Lou, A., Purushwalkam, S., Ermon, S., Xiong, C., Joty, S., Naik, N.: Diffusion model alignment using direct preference optimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8228–8238 (2024) 1

2024

[38] [38]

Advances in Neural Information Processing Systems37, 46118–46159 (2024) 8

Wang, F., Yin, H., Dong, Y.J., Zhu, H., Zhao, H., Qian, H., Li, C., et al.: Belm: Bidirectional explicit linear multi-step sampler for exact inversion in diffusion mod- els. Advances in Neural Information Processing Systems37, 46118–46159 (2024) 8

2024

[39] [39]

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Wang, Y., Li, Z., Zang, Y., Zhou, Y., Bu, J., Wang, C., Lu, Q., Jin, C., Wang, J.: Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image rein- forcement learning. arXiv preprint arXiv:2508.20751 (2025) 2, 4, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Unified Reward Model for Multimodal Understanding and Generation

Wang, Y., Zang, Y., Li, H., Jin, C., Wang, J.: Unified reward model for multimodal understanding and generation. arXiv preprint arXiv:2503.05236 (2025) 3, 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., Li, H.: Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023) 1, 2, 3, 13

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13294–13304 (2025) 3

2025

[44] [44]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. arXiv preprint arXiv:2408.12528 (2024) 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Xiong, T., Wang, X., Guo, D., Ye, Q., Fan, H., Gu, Q., Huang, H., Li, C.: Llava- critic: Learning to evaluate multimodal models (2025),https://arxiv.org/abs/ 2410.027123

work page arXiv 2025

[46] [46]

Advances in Neural Information Processing Systems36, 15903–15935 (2023) 1, 3, 4

Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagere- ward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems36, 15903–15935 (2023) 1, 3, 4

2023

[47] [47]

DanceGRPO: Unleashing GRPO on Visual Generation

Xue, Z., Wu, J., Gao, Y., Kong, F., Zhu, L., Chen, M., Liu, Z., Liu, W., Guo, Q., Huang, W., et al.: Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818 (2025) 1, 2, 4, 8, 13, 14 18 X. Li et al

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

ACM computing surveys56(4), 1–39 (2023) 3

Yang,L.,Zhang,Z.,Song,Y.,Hong,S.,Xu,R.,Zhao,Y.,Zhang,W.,Cui,B.,Yang, M.H.: Diffusion models: A comprehensive survey of methods and applications. ACM computing surveys56(4), 1–39 (2023) 3

2023

[49] [49]

arXiv preprint arXiv:2505.02527 (2025) 3

Yang, P., Cheung, N.M., Ma, X.: Text to image generation and editing: A survey. arXiv preprint arXiv:2505.02527 (2025) 3

work page arXiv 2025

[50] [50]

arXiv preprint arXiv:2512.00473 (2025) 2, 4, 13

Ye, J., Zhu, L., Guo, Y., Jiang, D., Huang, Z., Zhang, Y., Yan, Z., Fu, H., He, C., Li, W.: Realgen: Photorealistic text-to-image generation via detector-guided rewards. arXiv preprint arXiv:2512.00473 (2025) 2, 4, 13

work page arXiv 2025

[51] [51]

Yu, R., Wan, S., Wang, Y., Gao, C.X., Gan, L., Zhang, Z., Zhan, D.C.: Reward models in deep reinforcement learning: A survey (2025),https://arxiv.org/abs/ 2506.154213

work page arXiv 2025

[52] [52]

arXiv preprint arXiv:2303.07909 (2023) 3

Zhang, C., Zhang, C., Zhang, M., Kweon, I.S., Kim, J.: Text-to-image diffusion models in generative ai: A survey. arXiv preprint arXiv:2303.07909 (2023) 3

work page arXiv 2023