Embedding-perturbed Exploration Preference Optimization for Flow Models

Chubin Chen; Jiahong Wu; Jiashu Zhu; Sujie Hu; Xiangxiang Chu; Xiu Li

arxiv: 2605.15803 · v1 · pith:UGXFDZYJnew · submitted 2026-05-15 · 💻 cs.CV · cs.LG

Embedding-perturbed Exploration Preference Optimization for Flow Models

Sujie Hu , Chubin Chen , Jiashu Zhu , Jiahong Wu , Xiangxiang Chu , Xiu Li This is my paper

Pith reviewed 2026-05-20 18:28 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords preference optimizationflow modelsembedding perturbationreinforcement learningvariance maintenancehuman alignmentgenerative modelsexploration

0 comments

The pith

Embedding-level perturbations within sample groups sustain variance and keep the learning signal alive during preference optimization for flow models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that group-based reinforcement learning methods for aligning generative models quickly lose intra-group differences, causing variance to collapse and removing the signal needed for stable optimization. To fix this, it introduces structured perturbations applied directly at the embedding stage inside each group of samples. These perturbations are designed to maintain useful variance without destroying the semantic meaning of the samples or the correctness of their preference labels. If successful, the approach allows continued learning throughout training instead of early stagnation or reward hacking. Experiments indicate the resulting flow models align more closely with human preferences than prior techniques.

Core claim

Embedding-perturbed Exploration Preference Optimization (E²PO) adds structured perturbations at the embedding level inside sample groups. This produces a sustained intra-group variance that preserves the discriminative signal required for optimization. The framework therefore avoids the variance collapse that occurs in standard group-based methods and yields flow models whose outputs match human preferences more faithfully than existing baselines.

What carries the argument

Embedding-level perturbation inside sample groups: a controlled addition of structured noise at the embedding stage that maintains variance while leaving semantic content and preference ordering unchanged.

If this is right

Optimization remains stable because a non-zero discriminative signal persists even late in training.
Flow models reach higher human-preference alignment without requiring larger group sizes or repeated noise resampling.
The risk of premature policy stagnation or reward hacking is reduced.
The same perturbation principle offers a direct alternative to variance-increasing tricks that have shown diminishing returns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same embedding perturbation idea could be tested on diffusion or autoregressive models to check whether variance maintenance is architecture-specific.
Smaller group sizes might become viable if the perturbation reliably supplies the missing signal, lowering per-step compute.
Measuring output diversity on downstream tasks after training would reveal whether the added variance also improves sample variety.

Load-bearing premise

Perturbations added at the embedding level will increase useful variance without corrupting the semantic validity of the samples or the accuracy of their preference labels.

What would settle it

An experiment that applies the embedding perturbations and then finds either collapsed variance across groups or generated samples whose human preference rankings differ from the unperturbed versions would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.15803 by Chubin Chen, Jiahong Wu, Jiashu Zhu, Sujie Hu, Xiangxiang Chu, Xiu Li.

**Figure 1.** Figure 1: (Top) Unlike baselines that suffer from Vanishing Discriminative Signal, E 2PO sustains Discriminative Variance through embedding-level perturbation. (Bottom) By injecting structured perturbations into the semantic manifold (Center), our method significantly expands the exploration space (Right) and achieves superior reward alignment (Left) compared to state-of-the-art methods. Abstract Recent advancements… view at source ↗

**Figure 2.** Figure 2: Evolution of Intra-Group Discriminative Variance During Training. Smoothed curves track the changing trends of variance statistics as training proceeds. The baseline’s standard deviation decline significantly, whereas E 2 PO maintains a consistent variance level, demonstrating sufficient intra-group discriminative variance throughout the optimization process. 4.1. Intra-group Discriminative Variance Effect… view at source ↗

**Figure 3.** Figure 3: Method Overview. We introduce (a) selective perturbation on content embeddings to induce discriminative signal, (b) a noise-aware schedule to modulate condition injection during sampling, and (c) a reference-anchored strategy that calculates gradients relative to the original prompt C orig to prevent semantic drift. within the embedding space at every training step, forcing the policy to constantly navigat… view at source ↗

**Figure 4.** Figure 4: Qualitative Comparison of E 2PO against Baselines. E 2 PO demonstrates superior performance fidelity, spatial reasoning, instruction adherence and diversity, overcoming the limitations (highlighted in red circles) seen in other methods. Let Corig and C opt k denote the conditioning contexts derived from the original and the k-th optimized embeddings, respectively. Noise-Aware Sampling Schedule. Recognizin… view at source ↗

**Figure 5.** Figure 5: Ablation of Latent Group Size G and Number of Semantic Variants K. We analyze the trade-off between G and K under a fixed computational budget (N = G × K). We observe that the extreme configurations (G = 1 or K = 1) are insufficient, whereas a balanced split between G and K achieves the most stable and highest performance. w/ Noise-Aware Sampling Schedule (Ours) w/o Noise-Aware Sampling Schedule [PITH_FUL… view at source ↗

**Figure 6.** Figure 6: Ablation of Noise-Aware Sampling Schedule. Samples are generated at the 150-th training step. The static strategy (right) leads to semantic drift or artifacts, whereas our Noise-Aware Schedule (left) maintains high visual fidelity. vide insufficient exploration. This underscores the necessity of combining both noise-level and semantic-level diversity to effectively expand the exploration space. Second, re… view at source ↗

**Figure 8.** Figure 8: Human Preference Evaluation Results. We compare E 2 PO against SD3.5-M and RL-based baselines (DanceGRPO, Flow-GRPO, DiffusionNFT) across four key dimensions: Detail Preservation, Color Consistency, Image-Text Alignment, and Overall Quality. The results demonstrate that our method consistently achieves the highest user preference rates across all categories. B.1. E2PO Hyperparameters E2PO introduces a deco… view at source ↗

**Figure 9.** Figure 9: Ablation of Perturbation Scope. We compare perturbing the primary encoder (CLIP-L) against perturbing both encoders. The plot shows that limiting perturbation to CLIP-L ensures high-quality generation, while the dual-encoder strategy suffers from significant performance degradation and instability. C. Extended Experiments C.1. User Study To validate the effectiveness of E2PO in alignment with human prefere… view at source ↗

**Figure 10.** Figure 10: Qualitative Comparison of E 2PO against SOTAs on the GenEval Benchmark. All RL-based methods are trained using GenEval as the reward model. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative Comparison of E 2PO against SOTAs on the PickScore Benchmark. All RL-based methods are trained using PickScore as the reward model. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

read the original abstract

Recent advancements have established Reinforcement Learning (RL) as a pivotal paradigm for aligning generative models with human intent. However, group-based optimization frameworks (e.g., GRPO) face a critical limitation: the rapid decay of intra-group variance. As the distinctiveness among samples within a group diminishes, the variance approaches zero. This eliminates the very learning signal required for optimization, rendering the process unstable and forcing the policy into premature stagnation or reward hacking. Existing strategies, such as varying the initial noise or increasing group sizes, often fail to address this fundamental issue, resulting in training instability or diminishing returns. To overcome these challenges, we propose $\textbf{Embedding-perturbed Exploration Preference Optimization (}E^2\textbf{PO)}$, a novel framework that sustains optimization through embedding-level perturbation. Our method introduces structured, embedding-level perturbations within sample groups, guaranteeing a robust variance that preserves the discriminative signal throughout the training process. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving a more faithful alignment with human preference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Embedding perturbations to sustain variance in group RL for flow model alignment is a practical idea but the abstract supplies no math or checks on whether labels stay valid after the shift.

read the letter

The main thing here is that the authors flag variance collapse inside groups as a real blocker for GRPO-style preference tuning on flow models and suggest fixing it with structured embedding perturbations. That is the core of E^2PO. They argue existing fixes like bigger groups or different noise do not solve the root issue, and their method keeps the discriminative signal alive longer while claiming better human alignment in experiments. This is a reasonable engineering angle if it holds up. What stands out is the choice to perturb at the embedding level rather than at the input or group-size level; that specific mechanism is not in the references they cite. The paper does a clear job naming the instability that shows up when scaling these alignment runs. The soft spots are the missing pieces. No equations or bounds appear in the abstract for how the perturbations are constructed or why they commute with the preference oracle. The stress-test concern is on point: if an embedding shift moves a sample across a latent decision boundary that flips the human preference, the reward signal turns noisy and the optimization is no longer a faithful surrogate. Without an invariance argument or at least label-consistency metrics in the experiments, it is hard to know the variance gain is clean. The outperformance claim is also hard to judge since no numbers or baselines are shown. This paper is aimed at people who train flow models with human feedback and have already hit the variance wall in group RL. A reader who has tried GRPO variants would see the practical motivation right away. It deserves peer review if the full version includes the perturbation details, ablations, and some check on label stability; otherwise it stays too preliminary to engage deeply.

Referee Report

1 major / 1 minor

Summary. The paper proposes Embedding-perturbed Exploration Preference Optimization (E²PO) for flow models. It identifies rapid decay of intra-group variance as a core limitation in group-based RL frameworks such as GRPO, which eliminates the learning signal and leads to instability or reward hacking. The method introduces structured perturbations at the embedding level within sample groups to sustain variance while preserving the discriminative signal and semantic validity. Experiments are claimed to show significant outperformance over state-of-the-art baselines with more faithful human-preference alignment.

Significance. If the embedding perturbations can be shown to increase useful variance without corrupting preference labels, the approach would offer a targeted engineering fix for a known instability in preference optimization of generative models. This could improve training stability for flow-based architectures without requiring larger groups or noise variation, provided the invariance property holds.

major comments (1)

[Abstract and §3] Abstract and §3 (Method): The central claim that 'structured, embedding-level perturbations ... guaranteeing a robust variance that preserves the discriminative signal' requires that the perturbation operator leaves both semantic content and the correctness of human preference labels unchanged. No derivation, bound, or invariance argument is supplied showing that the perturbation commutes with the preference oracle or that embedding shifts do not cross decision boundaries corresponding to preference flips. This is load-bearing for the claim, as label corruption would turn the optimization objective into a misaligned surrogate.

minor comments (1)

[Abstract] The abstract supplies no equations, implementation details, or quantitative metrics, which hinders immediate technical assessment even though this is acceptable for an abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review of our manuscript. The major comment raises an important point about the need for justification that our embedding perturbations preserve semantic content and preference labels. We address this below and outline the changes we will make in revision.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Method): The central claim that 'structured, embedding-level perturbations ... guaranteeing a robust variance that preserves the discriminative signal' requires that the perturbation operator leaves both semantic content and the correctness of human preference labels unchanged. No derivation, bound, or invariance argument is supplied showing that the perturbation commutes with the preference oracle or that embedding shifts do not cross decision boundaries corresponding to preference flips. This is load-bearing for the claim, as label corruption would turn the optimization objective into a misaligned surrogate.

Authors: We agree that establishing preservation of semantic content and preference labels is central to the validity of E²PO. The current manuscript motivates the perturbations as small and structured within the embedding space of a fixed pre-trained encoder, with the claim supported by downstream empirical results showing improved alignment and stability. However, we acknowledge the absence of an explicit invariance argument or bound. In the revised manuscript we will add a new subsection in §3 that (i) provides a heuristic argument based on the local Lipschitz continuity of the embedding map and the small magnitude of the perturbations, (ii) reports an empirical label-consistency study in which human raters or a proxy preference model re-evaluate perturbed versus unperturbed pairs, and (iii) discusses the operating regime in which decision-boundary crossings are unlikely. These additions will make the load-bearing assumption explicit and testable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method presented as independent engineering intervention without reductive derivations

full rationale

The abstract and available text introduce E²PO as a novel framework that adds structured embedding-level perturbations to sustain intra-group variance in group-based RL optimization for flow models. No equations, derivations, or parameter-fitting steps are shown that reduce a claimed prediction or result back to the inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core claims. The variance-preservation benefit is asserted directly from the perturbation design rather than derived from fitted quantities or prior self-referential results. This is a standard non-circular engineering proposal; the derivation chain (if any exists in the full manuscript) does not exhibit the enumerated circular patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested assumption that embedding perturbations preserve semantic validity and on at least one tunable perturbation strength hyperparameter whose value is not reported.

free parameters (1)

perturbation magnitude
Controls how strongly embeddings are altered; must be chosen or fitted to keep variance high without harming sample quality.

axioms (1)

domain assumption Structured embedding perturbations increase intra-group variance while leaving preference signals intact.
This premise is required for the method to deliver a usable learning signal rather than noise or bias.

pith-pipeline@v0.9.0 · 5717 in / 1145 out tokens · 50153 ms · 2026-05-20T18:28:19.375871+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our method introduces structured, embedding-level perturbations within sample groups, guaranteeing a robust variance that preserves the discriminative signal throughout the training process.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose Embedding-perturbed Exploration Preference Optimization (E²PO)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation
cs.CV 2026-05 unverdicted novelty 5.0

EasyVFX decouples VFX generation via frequency-aware Mixture-of-Experts and test-time training to achieve realistic effects with limited resources.

Reference graph

Works this paper leans on

101 extracted references · 101 canonical work pages · cited by 1 Pith paper · 19 internal anchors

[1]

https://github.com/ discus0434/aesthetic-predictor-v2-5 ,

Aesthetic predictor v2.5. https://github.com/ discus0434/aesthetic-predictor-v2-5 ,

work page
[2]

Accessed: 2025-06-10

work page 2025
[3]

Ban, Y ., Wang, R., Zhou, T., Cheng, M., Gong, B., and Hsieh, C.-J. Understanding the impact of negative prompts: When and how do they take effect? In 8 E²PO: Embedding-perturbed Exploration Preference Optimization for Flow Models european conference on computer vision, pp. 190–206. Springer, 2024

work page 2024
[4]

Representa- tion learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelli- gence, 35(8):1798–1828, 2013

Bengio, Y ., Courville, A., and Vincent, P. Representa- tion learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelli- gence, 35(8):1798–1828, 2013

work page 2013
[5]

Training diffusion models with reinforcement learn- ing

Black, K., Janner, M., Du, Y ., Kostrikov, I., and Levine, S. Training diffusion models with reinforcement learn- ing. InThe Twelfth International Conference on Learn- ing Representations

work page
[6]

Taming preference mode collapse via directional decoupling alignment in diffusion reinforcement learning.arXiv preprint arXiv:2512.24146, 2025

Chen, C., Hu, S., Zhu, J., Wu, M., Chen, J., Li, Y ., Huang, N., Fang, C., Wu, J., Chu, X., et al. Taming preference mode collapse via directional decoupling alignment in diffusion reinforcement learning.arXiv preprint arXiv:2512.24146, 2025

work page arXiv 2025
[7]

Stochastic self- guidance for training-free enhancement of diffusion models

Chen, C., Zhu, J., Feng, X., Huang, N., Zhu, C., Wu, M., Mao, F., Wu, J., Chu, X., and Li, X. Stochastic self- guidance for training-free enhancement of diffusion models. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[8]

T2i- copilot: A training-free multi-agent text-to-image sys- tem for enhanced prompt interpretation and interactive generation

Chen, C.-Y ., Shi, M., Zhang, G., and Shi, H. T2i- copilot: A training-free multi-agent text-to-image sys- tem for enhanced prompt interpretation and interactive generation. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), pp. 19396–19405, October 2025

work page 2025
[9]

Unveiling chain of step reasoning for vision-language models with fine-grained rewards.Advances in Neural Information Processing Systems, 38:114703–114727, 2026

Chen, H., Lou, X., Feng, X., Huang, K., and Wang, X. Unveiling chain of step reasoning for vision-language models with fine-grained rewards.Advances in Neural Information Processing Systems, 38:114703–114727, 2026

work page 2026
[10]

Conceptweaver: Weav- ing disentangled concepts with flow.arXiv preprint arXiv:2603.28493, 2026

Chen, J., Hao, A., Chen, X., Bai, C., Chen, C., Li, Y ., Wu, J., Chu, X., and Zhang, S. Conceptweaver: Weav- ing disentangled concepts with flow.arXiv preprint arXiv:2603.28493, 2026

work page arXiv 2026
[11]

Contextflow: Training-free video object editing via adaptive context enrichment.arXiv preprint arXiv:2509.17818, 2025

Chen, Y ., He, X., Ma, X., and Ma, Y . Contextflow: Training-free video object editing via adaptive context enrichment.arXiv preprint arXiv:2509.17818, 2025

work page arXiv 2025
[12]

Chen, B., Martí Monsó, D., Du, Y ., Simchowitz, M., Tedrake, R., and Sitzmann, V

Chung, H., Kim, J., Park, G. Y ., Nam, H., and Ye, J. C. Cfg++: Manifold-constrained classifier free guidance for diffusion models.arXiv preprint arXiv:2406.08070, 2024

work page arXiv 2024
[13]

Clark, K., Vicol, P., Swersky, K., and Fleet, D. J. Di- rectly fine-tuning diffusion models on differentiable rewards.arXiv preprint arXiv:2309.17400, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

and Nichol, A

Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021

work page 2021
[15]

Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362,

Fan, Y . and Lee, K. Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362, 2023

work page arXiv 2023
[16]

Dpok: Reinforcement learning for fine- tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858– 79885, 2023

Fan, Y ., Watkins, O., Du, Y ., Liu, H., Ryu, M., Boutilier, C., Abbeel, P., Ghavamzadeh, M., Lee, K., and Lee, K. Dpok: Reinforcement learning for fine- tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858– 79885, 2023

work page 2023
[17]

Integrating extra modality helps segmentor find camouflaged objects well.arXiv preprint arXiv:2502.14471, 2025

Fang, C., He, C., Tang, L., Zhang, Y ., Zhu, C., Shen, Y ., Chen, C., Xu, G., and Li, X. Integrating extra modality helps segmentor find camouflaged objects well.arXiv preprint arXiv:2502.14471, 2025

work page arXiv 2025
[18]

PRISM: Rethinking Scattered Atmosphere Reconstruction as a Unified Understanding and Generation Model for Real-world Dehazing

Fang, C., He, C., Zhang, Y ., Chen, C., Zhu, C., Tang, L., and Li, X. Prism: Rethinking scattered atmosphere reconstruction as a unified understanding and gener- ation model for real-world dehazing.arXiv preprint arXiv:2604.07048, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Dit4edit: Diffusion transformer for image editing

Feng, K., Ma, Y ., Wang, B., Qi, C., Chen, H., Chen, Q., and Wang, Z. Dit4edit: Diffusion transformer for image editing. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 2969–2977, 2025

work page 2025
[20]

Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation.arXiv preprint arXiv:2507.11245, 2025

Feng, X., Yu, H., Wu, M., Hu, S., Chen, J., Zhu, C., Wu, J., Chu, X., and Huang, K. Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation.arXiv preprint arXiv:2507.11245, 2025

work page arXiv 2025
[21]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Gal, R., Alaluf, Y ., Atzmon, Y ., Patashnik, O., Bermano, A. H., Chechik, G., and Cohen-Or, D. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Geneval: An object-focused framework for evaluating text-to- image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Ghosh, D., Hajishirzi, H., and Schmidt, L. Geneval: An object-focused framework for evaluating text-to- image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

work page 2023
[23]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek- r1: Incentivizing reasoning capability in llms via rein- forcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

When Less is More: The LLM Scaling Paradox in Context Compression

Guo, R., Liu, Y ., Ma, G., Wang, Y ., Zhang, Y ., Xia, L., Chen, K., Sun, Z., and Shi, D. When less is more: The llm scaling paradox in context compression.arXiv preprint arXiv:2602.09789, 2026. 9 E²PO: Embedding-perturbed Exploration Preference Optimization for Flow Models

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter

Gupta, S., Ahuja, C., Lin, T.-Y ., Roy, S. D., Oost- erhuis, H., de Rijke, M., and Shukla, S. N. A sim- ple and effective reinforcement learning method for text-to-image diffusion fine-tuning.arXiv preprint arXiv:2503.00897, 2025

work page arXiv 2025
[26]

Vigor-bench: How far are visual generative models from zero-shot visual reasoners?, 2026

Han, H., Huang, J., Sun, X., He, J., Yang, R., Hu, J., Peng, X., Ma, L., Wei, X., and Li, X. Vigor-bench: How far are visual generative models from zero-shot visual reasoners?, 2026. URL https://arxiv. org/abs/2603.25823

work page arXiv 2026
[27]

Camouflaged object detection with feature decomposition and edge reconstruction

He, C., Li, K., Zhang, Y ., Tang, L., Zhang, Y ., Guo, Z., and Li, X. Camouflaged object detection with feature decomposition and edge reconstruction. InCVPR, pp. 22046–22055, 2023

work page 2023
[28]

Strategic preys make acute predators: Enhancing camouflaged object detectors by generating camouflaged objects.ICLR, 2024

He, C., Li, K., Zhang, Y ., Zhang, Y ., Guo, Z., Li, X., Danelljan, M., and Yu, F. Strategic preys make acute predators: Enhancing camouflaged object detectors by generating camouflaged objects.ICLR, 2024

work page 2024
[29]

Reti-diff: Illumina- tion degradation image restoration with retinex-based latent diffusion model.ICLR, 2025

He, C., Fang, C., Zhang, Y ., Ye, T., Li, K., Tang, L., Guo, Z., Li, X., and Farsiu, S. Reti-diff: Illumina- tion degradation image restoration with retinex-based latent diffusion model.ICLR, 2025

work page 2025
[30]

Segment concealed object with incomplete supervision.TPAMI, 2025

He, C., Li, K., Zhang, Y ., Yang, Z., Tang, L., Zhang, Y ., Kong, L., and Farsiu, S. Segment concealed object with incomplete supervision.TPAMI, 2025

work page 2025
[31]

Diffusion models in low-level vision: A survey.TPAMI, 2025

He, C., Shen, Y ., Fang, C., Xiao, F., Tang, L., Zhang, Y ., Zuo, W., Guo, Z., and Li, X. Diffusion models in low-level vision: A survey.TPAMI, 2025

work page 2025
[32]

Reversible unfolding network for concealed visual perception with generative refinement.arXiv preprint arXiv:2508.15027, 2025

He, C., Xiao, F., Zhang, R., Fang, C., Fan, D.-P., and Farsiu, S. Reversible unfolding network for concealed visual perception with generative refinement.arXiv preprint arXiv:2508.15027, 2025

work page arXiv 2025
[33]

UnfoldLDM: Degradation-Aware Unfolding with Iterative Latent Diffusion Priors for Blind Image Restoration

He, C., Zhang, R., Chen, Z., Yang, B., Fang, C., Lin, Y ., Xiao, F., and Farsiu, S. Unfoldldm: Deep unfolding- based blind image restoration with latent diffusion priors.arXiv preprint arXiv:2511.18152, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Refining context-entangled content segmen- tation via curriculum selection and anti-curriculum promotion.ICML, 2026

He, C., Zhang, R., Xiao, F., Zhang, D., Cao, Z., and Farsiu, S. Refining context-entangled content segmen- tation via curriculum selection and anti-curriculum promotion.ICML, 2026

work page 2026
[35]

TempFlow-GRPO: When Timing Matters for GRPO in Flow Models

He, X., Fu, S., Zhao, Y ., Li, W., Yang, J., Yin, D., Rao, F., and Zhang, B. Tempflow-grpo: When tim- ing matters for grpo in flow models.arXiv preprint arXiv:2508.04324, 2025

work page internal anchor Pith review arXiv 2025
[36]

Classifier-Free Diffusion Guidance

Ho, J. and Salimans, T. Classifier-free diffusion guid- ance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Sam-r1: Leveraging sam for reward feedback in multimodal segmentation via reinforcement learning.Advances in Neural Informa- tion Processing Systems, 38:138362–138383, 2026

Huang, J., Xu, Z., Zhou, J., Liu, T., Xiao, Y ., Ou, M., Ji, B., Li, X., and Yuan, K. Sam-r1: Leveraging sam for reward feedback in multimodal segmentation via reinforcement learning.Advances in Neural Informa- tion Processing Systems, 38:138362–138383, 2026

work page 2026
[38]

Mate: Images are all you need for material transfer via diffusion transformer

Huang, N., Liu, H., Lin, Y ., Huang, K., Chen, C., Guo, J., Lee, T.-y., and Li, X. Mate: Images are all you need for material transfer via diffusion transformer. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15117–15126, 2025

work page 2025
[39]

Tod3cap: Towards 3d dense captioning in outdoor scenes

Jin, B., Zheng, Y ., Li, P., Li, W., Zheng, Y ., Hu, S., Liu, X., Zhu, J., Yan, Z., Sun, H., et al. Tod3cap: Towards 3d dense captioning in outdoor scenes. In European Conference on Computer Vision, pp. 367–

work page
[40]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652– 36663, 2023

Kirstain, Y ., Polyak, A., Singer, U., Matiana, S., Penna, J., and Levy, O. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652– 36663, 2023

work page 2023
[41]

Crafting papers on machine learning

Langley, P. Crafting papers on machine learning. In Langley, P. (ed.),Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann

work page 2000
[42]

Lee, K., Liu, H., Ryu, M., Watkins, O., Du, Y ., Boutilier, C., Abbeel, P., Ghavamzadeh, M., and Gu, S. S. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Li, J., Cui, Y ., Huang, T., Ma, Y ., Fan, C., Yang, M., and Zhong, Z. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Reneg: Learning negative embedding with reward guidance

Li, X., Liu, Y ., Isobe, T., Jia, X., Cui, Q., Zhou, D., Li, D., He, Y ., Lu, H., Wang, Z., et al. Reneg: Learning negative embedding with reward guidance. InProceed- ings of the Computer Vision and Pattern Recognition Conference, pp. 23636–23645, 2025

work page 2025
[45]

Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

Li, Y ., Wang, Y ., Zhu, Y ., Zhao, Z., Lu, M., She, Q., and Zhang, S. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

work page arXiv 2025
[46]

Lipman, Y ., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling,

work page
[47]

URL https://arxiv.org/abs/2210. 02747

work page
[48]

Diversegrpo: Mitigating mode collapse in image generation via diversity-aware grpo.arXiv preprint arXiv:2512.21514, 2025

Liu, H., Huang, H., Wang, J., Liu, C., Li, X., and Ji, X. Diversegrpo: Mitigating mode collapse in image generation via diversity-aware grpo.arXiv preprint arXiv:2512.21514, 2025. 10 E²PO: Embedding-perturbed Exploration Preference Optimization for Flow Models

work page arXiv 2025
[49]

Flow-GRPO: Training Flow Matching Models via Online RL

Liu, J., Liu, G., Liang, J., Li, Y ., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-grpo: Train- ing flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[51]

Omnidiff: A comprehensive benchmark for fine- grained image difference captioning

Liu, Y ., Hou, S., Hou, S., Du, J., Meng, S., and Huang, Y . Omnidiff: A comprehensive benchmark for fine- grained image difference captioning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 21440–21449, 2025

work page 2025
[52]

Controllable layer decomposition for reversible multi-layer image generation.arXiv preprint arXiv:2511.16249, 2025

Liu, Z., Xu, Z., Shu, S., Zhou, J., Zhang, R., Tang, Z., and Li, X. Controllable layer decomposition for reversible multi-layer image generation.arXiv preprint arXiv:2511.16249, 2025

work page arXiv 2025
[53]

Follow-your-shape: Shape-aware image edit- ing via trajectory-guided region control.arXiv preprint arXiv:2508.08134, 2025

Long, Z., Zheng, M., Feng, K., Zhang, X., Liu, H., Yang, H., Zhang, L., Chen, Q., and Ma, Y . Follow-your-shape: Shape-aware image editing via trajectory-guided region control.arXiv preprint arXiv:2508.08134, 2025

work page arXiv 2025
[54]

Stage: Stable and generalizable grpo for autoregressive image generation.arXiv preprint arXiv:2509.25027, 2025

Ma, X., Qiu, H., Zhang, G., Zeng, Z., Yang, S., Ma, L., and Zhao, F. Stage: Stable and generalizable grpo for autoregressive image generation, 2025. URL https: //arxiv.org/abs/2509.25027

work page arXiv 2025
[55]

MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

Ma, X., Lei, J., Ren, T., Huang, J., Fu, S., Hao, A., Wu, J., Chu, X., and Zhao, F. Mar-grpo: Stabilized grpo for ar-diffusion hybrid image generation, 2026. URL https://arxiv.org/abs/2604.06966

work page internal anchor Pith review Pith/arXiv arXiv 2026
[56]

Follow your pose: Pose-guided text-to- video generation using pose-free videos

Ma, Y ., He, Y ., Cun, X., Wang, X., Chen, S., Li, X., and Chen, Q. Follow your pose: Pose-guided text-to- video generation using pose-free videos. InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 4117–4125, 2024

work page 2024
[57]

Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation

Ma, Y ., Liu, H., Wang, H., Pan, H., He, Y ., Yuan, J., Zeng, A., Cai, C., Shum, H.-Y ., Liu, W., et al. Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation. InSIGGRAPH Asia 2024 Conference Papers, pp. 1–12, 2024

work page 2024
[58]

Controllable video generation: A survey.arXiv preprint arXiv:2507.16869,

Ma, Y ., Feng, K., Hu, Z., Wang, X., Wang, Y ., Zheng, M., He, X., Zhu, C., Liu, H., He, Y ., et al. Con- trollable video generation: A survey.arXiv preprint arXiv:2507.16869, 2025

work page arXiv 2025
[59]

Follow-your-creation: Empowering 4d creation through video inpainting.arXiv preprint arXiv:2506.04590, 2025

Ma, Y ., Feng, K., Zhang, X., Liu, H., Zhang, D. J., Xing, J., Zhang, Y ., Yang, A., Wang, Z., and Chen, Q. Follow-your-creation: Empowering 4d creation through video inpainting.arXiv preprint arXiv:2506.04590, 2025

work page arXiv 2025
[60]

Follow- your-click: Open-domain regional image animation via motion prompts

Ma, Y ., He, Y ., Wang, H., Wang, A., Shen, L., Qi, C., Ying, J., Cai, C., Li, Z., Shum, H.-Y ., et al. Follow- your-click: Open-domain regional image animation via motion prompts. InProceedings of the AAAI Con- ference on Artificial Intelligence, volume 39, pp. 6018– 6026, 2025

work page 2025
[61]

Follow-your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning.arXiv preprint arXiv:2506.05207, 2025

Ma, Y ., Liu, Y ., Zhu, Q., Yang, A., Feng, K., Zhang, X., Li, Z., Han, S., Qi, C., and Chen, Q. Follow- your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning.arXiv preprint arXiv:2506.05207, 2025

work page arXiv 2025
[62]

Follow-your-emoji-faster: To- wards efficient, fine-controllable, and expressive freestyle portrait animation.arXiv preprint arXiv:2509.16630, 2025

Ma, Y ., Yan, Z., Liu, H., Wang, H., Pan, H., He, Y ., Yuan, J., Zeng, A., Cai, C., Shum, H.-Y ., et al. Follow- your-emoji-faster: Towards efficient, fine-controllable, and expressive freestyle portrait animation.arXiv preprint arXiv:2509.16630, 2025

work page arXiv 2025
[63]

Omni-effects: Unified and spatially-controllable visual effects gener- ation

Mao, F., Hao, A., Chen, J., Liu, D., Feng, X., Zhu, J., Wu, M., Chen, C., Wu, J., and Chu, X. Omni-effects: Unified and spatially-controllable visual effects gener- ation. InProceedings of the AAAI Conference on Arti- ficial Intelligence, volume 40, pp. 7927–7935, 2026

work page 2026
[64]

Training-free generation of diverse and high-fidelity images via prompt semantic space optimization, 2025

Meng, D., Jin, C., Gao, Z., Li, Y ., Patras, I., and Tz- imiropoulos, G. Training-free generation of diverse and high-fidelity images via prompt semantic space optimization, 2025. URL https://arxiv.org/ abs/2511.19811

work page arXiv 2025
[65]

Training diffusion models to- wards diverse image generation with reinforcement learning

Miao, Z., Wang, J., Wang, Z., Yang, Z., Wang, L., Qiu, Q., and Liu, Z. Training diffusion models to- wards diverse image generation with reinforcement learning. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pp. 10844–10853, 2024

work page 2024
[66]

Training language models to follow instructions with human feedback.Advances in neu- ral information processing systems, 35:27730–27744, 2022

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wain- wright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback.Advances in neu- ral information processing systems, 35:27730–27744, 2022

work page 2022
[67]

Video diffusion alignment via reward gradients.arXiv preprint arXiv:2407.08737, 2024

Prabhudesai, M., Mendonca, R., Qin, Z., Fragkiadaki, K., and Pathak, D. Video diffusion alignment via reward gradients.arXiv preprint arXiv:2407.08737, 2024

work page arXiv 2024
[68]

gradient descent

Pryzant, R., Iter, D., Li, J., Lee, Y . T., Zhu, C., and Zeng, M. Automatic prompt optimization with” gradient descent” and beam search.arXiv preprint arXiv:2305.03495, 2023

work page arXiv 2023
[69]

High-resolution image synthesis with la- tent diffusion models

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with la- tent diffusion models. InProceedings of the IEEE/CVF 11 E²PO: Embedding-perturbed Exploration Preference Optimization for Flow Models conference on computer vision and pattern recogni- tion, pp. 10684–10695, 2022

work page 2022
[70]

Dreambooth: Fine tuning text-to- image diffusion models for subject-driven generation

Ruiz, N., Li, Y ., Jampani, V ., Pritch, Y ., Rubinstein, M., and Aberman, K. Dreambooth: Fine tuning text-to- image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pp. 22500–22510, 2023

work page 2023
[71]

Uncertainty-masked bernoulli diffusion for camouflaged object detection refinement.arXiv preprint arXiv:2506.10712, 2025

Shen, Y ., Xiao, F., Hu, S., Pang, Y ., Pu, Y ., Fang, C., Li, X., and He, C. Uncertainty-masked bernoulli diffusion for camouflaged object detection refinement.arXiv preprint arXiv:2506.10712, 2025

work page arXiv 2025
[72]

Follow-your-preference: Towards preference- aligned image inpainting.arXiv preprint arXiv:2509.23082,

Shen, Y ., Yuan, J., Aonishi, T., Nakayama, H., and Ma, Y . Follow-your-preference: Towards preference-aligned image inpainting.arXiv preprint arXiv:2509.23082, 2025

work page arXiv 2025
[73]

Defining and characterizing reward gaming.Ad- vances in Neural Information Processing Systems, 35: 9460–9471, 2022

Skalse, J., Howe, N., Krasheninnikov, D., and Krueger, D. Defining and characterizing reward gaming.Ad- vances in Neural Information Processing Systems, 35: 9460–9471, 2022

work page 2022
[74]

Tam- ing rectified flow for inversion and editing

Wang, J., Pu, J., Qi, Z., Guo, J., Ma, Y ., Huang, N., Chen, Y ., Li, X., and Shan, Y . Taming recti- fied flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024

work page arXiv 2024
[75]

Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025a

Wang, J., Liang, J., Liu, J., Liu, H., Liu, G., Zheng, J., Pang, W., Ma, A., Xie, Z., Wang, X., et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025

work page arXiv 2025
[76]

Elastic diffusion transformer

Wang, J., Lai, Z., Chen, J., Guo, J., Guo, H., Li, X., Yue, X., and Guo, C. Elastic diffusion transformer. arXiv preprint arXiv:2602.13993, 2026

work page arXiv 2026
[77]

Precisecache: Precise feature caching for efficient and high-fidelity video genera- tion

Wang, J., Zhao, K., Guo, J., Wang, J., Guo, H., Zhu, C., Yue, X., and Li, X. Precisecache: Precise feature caching for efficient and high-fidelity video genera- tion. InThe Fourteenth International Conference on Learning Representations, 2026. URL https:// openreview.net/forum?id=DjfRkr82jn

work page 2026
[78]

Towards a golden classifier-free guidance path via foresight fixed point iterations.arXiv preprint arXiv:2510.21512, 2025

Wang, K., Mao, J., Wu, T., and Xiang, Y . Towards a golden classifier-free guidance path via foresight fixed point iterations.arXiv preprint arXiv:2510.21512, 2025

work page arXiv 2025
[79]

On dis- crete prompt optimization for diffusion models.arXiv preprint arXiv:2407.01606, 2024

Wang, R., Liu, T., Hsieh, C.-J., and Gong, B. On dis- crete prompt optimization for diffusion models.arXiv preprint arXiv:2407.01606, 2024

work page arXiv 2024
[80]

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Wang, Y ., Li, Z., Zang, Y ., Zhou, Y ., Bu, J., Wang, C., Lu, Q., Jin, C., and Wang, J. Pref-grpo: Pairwise pref- erence reward-based grpo for stable text-to-image rein- forcement learning.arXiv preprint arXiv:2508.20751, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

https://github.com/ discus0434/aesthetic-predictor-v2-5 ,

Aesthetic predictor v2.5. https://github.com/ discus0434/aesthetic-predictor-v2-5 ,

work page

[2] [2]

Accessed: 2025-06-10

work page 2025

[3] [3]

Ban, Y ., Wang, R., Zhou, T., Cheng, M., Gong, B., and Hsieh, C.-J. Understanding the impact of negative prompts: When and how do they take effect? In 8 E²PO: Embedding-perturbed Exploration Preference Optimization for Flow Models european conference on computer vision, pp. 190–206. Springer, 2024

work page 2024

[4] [4]

Representa- tion learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelli- gence, 35(8):1798–1828, 2013

Bengio, Y ., Courville, A., and Vincent, P. Representa- tion learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelli- gence, 35(8):1798–1828, 2013

work page 2013

[5] [5]

Training diffusion models with reinforcement learn- ing

Black, K., Janner, M., Du, Y ., Kostrikov, I., and Levine, S. Training diffusion models with reinforcement learn- ing. InThe Twelfth International Conference on Learn- ing Representations

work page

[6] [6]

Taming preference mode collapse via directional decoupling alignment in diffusion reinforcement learning.arXiv preprint arXiv:2512.24146, 2025

Chen, C., Hu, S., Zhu, J., Wu, M., Chen, J., Li, Y ., Huang, N., Fang, C., Wu, J., Chu, X., et al. Taming preference mode collapse via directional decoupling alignment in diffusion reinforcement learning.arXiv preprint arXiv:2512.24146, 2025

work page arXiv 2025

[7] [7]

Stochastic self- guidance for training-free enhancement of diffusion models

Chen, C., Zhu, J., Feng, X., Huang, N., Zhu, C., Wu, M., Mao, F., Wu, J., Chu, X., and Li, X. Stochastic self- guidance for training-free enhancement of diffusion models. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[8] [8]

T2i- copilot: A training-free multi-agent text-to-image sys- tem for enhanced prompt interpretation and interactive generation

Chen, C.-Y ., Shi, M., Zhang, G., and Shi, H. T2i- copilot: A training-free multi-agent text-to-image sys- tem for enhanced prompt interpretation and interactive generation. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), pp. 19396–19405, October 2025

work page 2025

[9] [9]

Unveiling chain of step reasoning for vision-language models with fine-grained rewards.Advances in Neural Information Processing Systems, 38:114703–114727, 2026

Chen, H., Lou, X., Feng, X., Huang, K., and Wang, X. Unveiling chain of step reasoning for vision-language models with fine-grained rewards.Advances in Neural Information Processing Systems, 38:114703–114727, 2026

work page 2026

[10] [10]

Conceptweaver: Weav- ing disentangled concepts with flow.arXiv preprint arXiv:2603.28493, 2026

Chen, J., Hao, A., Chen, X., Bai, C., Chen, C., Li, Y ., Wu, J., Chu, X., and Zhang, S. Conceptweaver: Weav- ing disentangled concepts with flow.arXiv preprint arXiv:2603.28493, 2026

work page arXiv 2026

[11] [11]

Contextflow: Training-free video object editing via adaptive context enrichment.arXiv preprint arXiv:2509.17818, 2025

Chen, Y ., He, X., Ma, X., and Ma, Y . Contextflow: Training-free video object editing via adaptive context enrichment.arXiv preprint arXiv:2509.17818, 2025

work page arXiv 2025

[12] [12]

Chen, B., Martí Monsó, D., Du, Y ., Simchowitz, M., Tedrake, R., and Sitzmann, V

Chung, H., Kim, J., Park, G. Y ., Nam, H., and Ye, J. C. Cfg++: Manifold-constrained classifier free guidance for diffusion models.arXiv preprint arXiv:2406.08070, 2024

work page arXiv 2024

[13] [13]

Clark, K., Vicol, P., Swersky, K., and Fleet, D. J. Di- rectly fine-tuning diffusion models on differentiable rewards.arXiv preprint arXiv:2309.17400, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

and Nichol, A

Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021

work page 2021

[15] [15]

Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362,

Fan, Y . and Lee, K. Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362, 2023

work page arXiv 2023

[16] [16]

Dpok: Reinforcement learning for fine- tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858– 79885, 2023

Fan, Y ., Watkins, O., Du, Y ., Liu, H., Ryu, M., Boutilier, C., Abbeel, P., Ghavamzadeh, M., Lee, K., and Lee, K. Dpok: Reinforcement learning for fine- tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858– 79885, 2023

work page 2023

[17] [17]

Integrating extra modality helps segmentor find camouflaged objects well.arXiv preprint arXiv:2502.14471, 2025

Fang, C., He, C., Tang, L., Zhang, Y ., Zhu, C., Shen, Y ., Chen, C., Xu, G., and Li, X. Integrating extra modality helps segmentor find camouflaged objects well.arXiv preprint arXiv:2502.14471, 2025

work page arXiv 2025

[18] [18]

PRISM: Rethinking Scattered Atmosphere Reconstruction as a Unified Understanding and Generation Model for Real-world Dehazing

Fang, C., He, C., Zhang, Y ., Chen, C., Zhu, C., Tang, L., and Li, X. Prism: Rethinking scattered atmosphere reconstruction as a unified understanding and gener- ation model for real-world dehazing.arXiv preprint arXiv:2604.07048, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Dit4edit: Diffusion transformer for image editing

Feng, K., Ma, Y ., Wang, B., Qi, C., Chen, H., Chen, Q., and Wang, Z. Dit4edit: Diffusion transformer for image editing. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 2969–2977, 2025

work page 2025

[20] [20]

Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation.arXiv preprint arXiv:2507.11245, 2025

Feng, X., Yu, H., Wu, M., Hu, S., Chen, J., Zhu, C., Wu, J., Chu, X., and Huang, K. Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation.arXiv preprint arXiv:2507.11245, 2025

work page arXiv 2025

[21] [21]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Gal, R., Alaluf, Y ., Atzmon, Y ., Patashnik, O., Bermano, A. H., Chechik, G., and Cohen-Or, D. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

Geneval: An object-focused framework for evaluating text-to- image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Ghosh, D., Hajishirzi, H., and Schmidt, L. Geneval: An object-focused framework for evaluating text-to- image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

work page 2023

[23] [23]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek- r1: Incentivizing reasoning capability in llms via rein- forcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

When Less is More: The LLM Scaling Paradox in Context Compression

Guo, R., Liu, Y ., Ma, G., Wang, Y ., Zhang, Y ., Xia, L., Chen, K., Sun, Z., and Shi, D. When less is more: The llm scaling paradox in context compression.arXiv preprint arXiv:2602.09789, 2026. 9 E²PO: Embedding-perturbed Exploration Preference Optimization for Flow Models

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter

Gupta, S., Ahuja, C., Lin, T.-Y ., Roy, S. D., Oost- erhuis, H., de Rijke, M., and Shukla, S. N. A sim- ple and effective reinforcement learning method for text-to-image diffusion fine-tuning.arXiv preprint arXiv:2503.00897, 2025

work page arXiv 2025

[26] [26]

Vigor-bench: How far are visual generative models from zero-shot visual reasoners?, 2026

Han, H., Huang, J., Sun, X., He, J., Yang, R., Hu, J., Peng, X., Ma, L., Wei, X., and Li, X. Vigor-bench: How far are visual generative models from zero-shot visual reasoners?, 2026. URL https://arxiv. org/abs/2603.25823

work page arXiv 2026

[27] [27]

Camouflaged object detection with feature decomposition and edge reconstruction

He, C., Li, K., Zhang, Y ., Tang, L., Zhang, Y ., Guo, Z., and Li, X. Camouflaged object detection with feature decomposition and edge reconstruction. InCVPR, pp. 22046–22055, 2023

work page 2023

[28] [28]

Strategic preys make acute predators: Enhancing camouflaged object detectors by generating camouflaged objects.ICLR, 2024

He, C., Li, K., Zhang, Y ., Zhang, Y ., Guo, Z., Li, X., Danelljan, M., and Yu, F. Strategic preys make acute predators: Enhancing camouflaged object detectors by generating camouflaged objects.ICLR, 2024

work page 2024

[29] [29]

Reti-diff: Illumina- tion degradation image restoration with retinex-based latent diffusion model.ICLR, 2025

He, C., Fang, C., Zhang, Y ., Ye, T., Li, K., Tang, L., Guo, Z., Li, X., and Farsiu, S. Reti-diff: Illumina- tion degradation image restoration with retinex-based latent diffusion model.ICLR, 2025

work page 2025

[30] [30]

Segment concealed object with incomplete supervision.TPAMI, 2025

He, C., Li, K., Zhang, Y ., Yang, Z., Tang, L., Zhang, Y ., Kong, L., and Farsiu, S. Segment concealed object with incomplete supervision.TPAMI, 2025

work page 2025

[31] [31]

Diffusion models in low-level vision: A survey.TPAMI, 2025

He, C., Shen, Y ., Fang, C., Xiao, F., Tang, L., Zhang, Y ., Zuo, W., Guo, Z., and Li, X. Diffusion models in low-level vision: A survey.TPAMI, 2025

work page 2025

[32] [32]

Reversible unfolding network for concealed visual perception with generative refinement.arXiv preprint arXiv:2508.15027, 2025

He, C., Xiao, F., Zhang, R., Fang, C., Fan, D.-P., and Farsiu, S. Reversible unfolding network for concealed visual perception with generative refinement.arXiv preprint arXiv:2508.15027, 2025

work page arXiv 2025

[33] [33]

UnfoldLDM: Degradation-Aware Unfolding with Iterative Latent Diffusion Priors for Blind Image Restoration

He, C., Zhang, R., Chen, Z., Yang, B., Fang, C., Lin, Y ., Xiao, F., and Farsiu, S. Unfoldldm: Deep unfolding- based blind image restoration with latent diffusion priors.arXiv preprint arXiv:2511.18152, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Refining context-entangled content segmen- tation via curriculum selection and anti-curriculum promotion.ICML, 2026

He, C., Zhang, R., Xiao, F., Zhang, D., Cao, Z., and Farsiu, S. Refining context-entangled content segmen- tation via curriculum selection and anti-curriculum promotion.ICML, 2026

work page 2026

[35] [35]

TempFlow-GRPO: When Timing Matters for GRPO in Flow Models

He, X., Fu, S., Zhao, Y ., Li, W., Yang, J., Yin, D., Rao, F., and Zhang, B. Tempflow-grpo: When tim- ing matters for grpo in flow models.arXiv preprint arXiv:2508.04324, 2025

work page internal anchor Pith review arXiv 2025

[36] [36]

Classifier-Free Diffusion Guidance

Ho, J. and Salimans, T. Classifier-free diffusion guid- ance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [37]

Sam-r1: Leveraging sam for reward feedback in multimodal segmentation via reinforcement learning.Advances in Neural Informa- tion Processing Systems, 38:138362–138383, 2026

Huang, J., Xu, Z., Zhou, J., Liu, T., Xiao, Y ., Ou, M., Ji, B., Li, X., and Yuan, K. Sam-r1: Leveraging sam for reward feedback in multimodal segmentation via reinforcement learning.Advances in Neural Informa- tion Processing Systems, 38:138362–138383, 2026

work page 2026

[38] [38]

Mate: Images are all you need for material transfer via diffusion transformer

Huang, N., Liu, H., Lin, Y ., Huang, K., Chen, C., Guo, J., Lee, T.-y., and Li, X. Mate: Images are all you need for material transfer via diffusion transformer. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15117–15126, 2025

work page 2025

[39] [39]

Tod3cap: Towards 3d dense captioning in outdoor scenes

Jin, B., Zheng, Y ., Li, P., Li, W., Zheng, Y ., Hu, S., Liu, X., Zhu, J., Yan, Z., Sun, H., et al. Tod3cap: Towards 3d dense captioning in outdoor scenes. In European Conference on Computer Vision, pp. 367–

work page

[40] [40]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652– 36663, 2023

Kirstain, Y ., Polyak, A., Singer, U., Matiana, S., Penna, J., and Levy, O. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652– 36663, 2023

work page 2023

[41] [41]

Crafting papers on machine learning

Langley, P. Crafting papers on machine learning. In Langley, P. (ed.),Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann

work page 2000

[42] [42]

Lee, K., Liu, H., Ryu, M., Watkins, O., Du, Y ., Boutilier, C., Abbeel, P., Ghavamzadeh, M., and Gu, S. S. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Li, J., Cui, Y ., Huang, T., Ma, Y ., Fan, C., Yang, M., and Zhong, Z. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Reneg: Learning negative embedding with reward guidance

Li, X., Liu, Y ., Isobe, T., Jia, X., Cui, Q., Zhou, D., Li, D., He, Y ., Lu, H., Wang, Z., et al. Reneg: Learning negative embedding with reward guidance. InProceed- ings of the Computer Vision and Pattern Recognition Conference, pp. 23636–23645, 2025

work page 2025

[45] [45]

Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

Li, Y ., Wang, Y ., Zhu, Y ., Zhao, Z., Lu, M., She, Q., and Zhang, S. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

work page arXiv 2025

[46] [46]

Lipman, Y ., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling,

work page

[47] [47]

URL https://arxiv.org/abs/2210. 02747

work page

[48] [48]

Diversegrpo: Mitigating mode collapse in image generation via diversity-aware grpo.arXiv preprint arXiv:2512.21514, 2025

Liu, H., Huang, H., Wang, J., Liu, C., Li, X., and Ji, X. Diversegrpo: Mitigating mode collapse in image generation via diversity-aware grpo.arXiv preprint arXiv:2512.21514, 2025. 10 E²PO: Embedding-perturbed Exploration Preference Optimization for Flow Models

work page arXiv 2025

[49] [49]

Flow-GRPO: Training Flow Matching Models via Online RL

Liu, J., Liu, G., Liang, J., Li, Y ., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-grpo: Train- ing flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[51] [51]

Omnidiff: A comprehensive benchmark for fine- grained image difference captioning

Liu, Y ., Hou, S., Hou, S., Du, J., Meng, S., and Huang, Y . Omnidiff: A comprehensive benchmark for fine- grained image difference captioning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 21440–21449, 2025

work page 2025

[52] [52]

Controllable layer decomposition for reversible multi-layer image generation.arXiv preprint arXiv:2511.16249, 2025

Liu, Z., Xu, Z., Shu, S., Zhou, J., Zhang, R., Tang, Z., and Li, X. Controllable layer decomposition for reversible multi-layer image generation.arXiv preprint arXiv:2511.16249, 2025

work page arXiv 2025

[53] [53]

Follow-your-shape: Shape-aware image edit- ing via trajectory-guided region control.arXiv preprint arXiv:2508.08134, 2025

Long, Z., Zheng, M., Feng, K., Zhang, X., Liu, H., Yang, H., Zhang, L., Chen, Q., and Ma, Y . Follow-your-shape: Shape-aware image editing via trajectory-guided region control.arXiv preprint arXiv:2508.08134, 2025

work page arXiv 2025

[54] [54]

Stage: Stable and generalizable grpo for autoregressive image generation.arXiv preprint arXiv:2509.25027, 2025

Ma, X., Qiu, H., Zhang, G., Zeng, Z., Yang, S., Ma, L., and Zhao, F. Stage: Stable and generalizable grpo for autoregressive image generation, 2025. URL https: //arxiv.org/abs/2509.25027

work page arXiv 2025

[55] [55]

MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

Ma, X., Lei, J., Ren, T., Huang, J., Fu, S., Hao, A., Wu, J., Chu, X., and Zhao, F. Mar-grpo: Stabilized grpo for ar-diffusion hybrid image generation, 2026. URL https://arxiv.org/abs/2604.06966

work page internal anchor Pith review Pith/arXiv arXiv 2026

[56] [56]

Follow your pose: Pose-guided text-to- video generation using pose-free videos

Ma, Y ., He, Y ., Cun, X., Wang, X., Chen, S., Li, X., and Chen, Q. Follow your pose: Pose-guided text-to- video generation using pose-free videos. InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 4117–4125, 2024

work page 2024

[57] [57]

Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation

Ma, Y ., Liu, H., Wang, H., Pan, H., He, Y ., Yuan, J., Zeng, A., Cai, C., Shum, H.-Y ., Liu, W., et al. Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation. InSIGGRAPH Asia 2024 Conference Papers, pp. 1–12, 2024

work page 2024

[58] [58]

Controllable video generation: A survey.arXiv preprint arXiv:2507.16869,

Ma, Y ., Feng, K., Hu, Z., Wang, X., Wang, Y ., Zheng, M., He, X., Zhu, C., Liu, H., He, Y ., et al. Con- trollable video generation: A survey.arXiv preprint arXiv:2507.16869, 2025

work page arXiv 2025

[59] [59]

Follow-your-creation: Empowering 4d creation through video inpainting.arXiv preprint arXiv:2506.04590, 2025

Ma, Y ., Feng, K., Zhang, X., Liu, H., Zhang, D. J., Xing, J., Zhang, Y ., Yang, A., Wang, Z., and Chen, Q. Follow-your-creation: Empowering 4d creation through video inpainting.arXiv preprint arXiv:2506.04590, 2025

work page arXiv 2025

[60] [60]

Follow- your-click: Open-domain regional image animation via motion prompts

Ma, Y ., He, Y ., Wang, H., Wang, A., Shen, L., Qi, C., Ying, J., Cai, C., Li, Z., Shum, H.-Y ., et al. Follow- your-click: Open-domain regional image animation via motion prompts. InProceedings of the AAAI Con- ference on Artificial Intelligence, volume 39, pp. 6018– 6026, 2025

work page 2025

[61] [61]

Follow-your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning.arXiv preprint arXiv:2506.05207, 2025

Ma, Y ., Liu, Y ., Zhu, Q., Yang, A., Feng, K., Zhang, X., Li, Z., Han, S., Qi, C., and Chen, Q. Follow- your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning.arXiv preprint arXiv:2506.05207, 2025

work page arXiv 2025

[62] [62]

Follow-your-emoji-faster: To- wards efficient, fine-controllable, and expressive freestyle portrait animation.arXiv preprint arXiv:2509.16630, 2025

Ma, Y ., Yan, Z., Liu, H., Wang, H., Pan, H., He, Y ., Yuan, J., Zeng, A., Cai, C., Shum, H.-Y ., et al. Follow- your-emoji-faster: Towards efficient, fine-controllable, and expressive freestyle portrait animation.arXiv preprint arXiv:2509.16630, 2025

work page arXiv 2025

[63] [63]

Omni-effects: Unified and spatially-controllable visual effects gener- ation

Mao, F., Hao, A., Chen, J., Liu, D., Feng, X., Zhu, J., Wu, M., Chen, C., Wu, J., and Chu, X. Omni-effects: Unified and spatially-controllable visual effects gener- ation. InProceedings of the AAAI Conference on Arti- ficial Intelligence, volume 40, pp. 7927–7935, 2026

work page 2026

[64] [64]

Training-free generation of diverse and high-fidelity images via prompt semantic space optimization, 2025

Meng, D., Jin, C., Gao, Z., Li, Y ., Patras, I., and Tz- imiropoulos, G. Training-free generation of diverse and high-fidelity images via prompt semantic space optimization, 2025. URL https://arxiv.org/ abs/2511.19811

work page arXiv 2025

[65] [65]

Training diffusion models to- wards diverse image generation with reinforcement learning

Miao, Z., Wang, J., Wang, Z., Yang, Z., Wang, L., Qiu, Q., and Liu, Z. Training diffusion models to- wards diverse image generation with reinforcement learning. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pp. 10844–10853, 2024

work page 2024

[66] [66]

Training language models to follow instructions with human feedback.Advances in neu- ral information processing systems, 35:27730–27744, 2022

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wain- wright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback.Advances in neu- ral information processing systems, 35:27730–27744, 2022

work page 2022

[67] [67]

Video diffusion alignment via reward gradients.arXiv preprint arXiv:2407.08737, 2024

Prabhudesai, M., Mendonca, R., Qin, Z., Fragkiadaki, K., and Pathak, D. Video diffusion alignment via reward gradients.arXiv preprint arXiv:2407.08737, 2024

work page arXiv 2024

[68] [68]

gradient descent

Pryzant, R., Iter, D., Li, J., Lee, Y . T., Zhu, C., and Zeng, M. Automatic prompt optimization with” gradient descent” and beam search.arXiv preprint arXiv:2305.03495, 2023

work page arXiv 2023

[69] [69]

High-resolution image synthesis with la- tent diffusion models

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with la- tent diffusion models. InProceedings of the IEEE/CVF 11 E²PO: Embedding-perturbed Exploration Preference Optimization for Flow Models conference on computer vision and pattern recogni- tion, pp. 10684–10695, 2022

work page 2022

[70] [70]

Dreambooth: Fine tuning text-to- image diffusion models for subject-driven generation

Ruiz, N., Li, Y ., Jampani, V ., Pritch, Y ., Rubinstein, M., and Aberman, K. Dreambooth: Fine tuning text-to- image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pp. 22500–22510, 2023

work page 2023

[71] [71]

Uncertainty-masked bernoulli diffusion for camouflaged object detection refinement.arXiv preprint arXiv:2506.10712, 2025

Shen, Y ., Xiao, F., Hu, S., Pang, Y ., Pu, Y ., Fang, C., Li, X., and He, C. Uncertainty-masked bernoulli diffusion for camouflaged object detection refinement.arXiv preprint arXiv:2506.10712, 2025

work page arXiv 2025

[72] [72]

Follow-your-preference: Towards preference- aligned image inpainting.arXiv preprint arXiv:2509.23082,

Shen, Y ., Yuan, J., Aonishi, T., Nakayama, H., and Ma, Y . Follow-your-preference: Towards preference-aligned image inpainting.arXiv preprint arXiv:2509.23082, 2025

work page arXiv 2025

[73] [73]

Defining and characterizing reward gaming.Ad- vances in Neural Information Processing Systems, 35: 9460–9471, 2022

Skalse, J., Howe, N., Krasheninnikov, D., and Krueger, D. Defining and characterizing reward gaming.Ad- vances in Neural Information Processing Systems, 35: 9460–9471, 2022

work page 2022

[74] [74]

Tam- ing rectified flow for inversion and editing

Wang, J., Pu, J., Qi, Z., Guo, J., Ma, Y ., Huang, N., Chen, Y ., Li, X., and Shan, Y . Taming recti- fied flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024

work page arXiv 2024

[75] [75]

Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025a

Wang, J., Liang, J., Liu, J., Liu, H., Liu, G., Zheng, J., Pang, W., Ma, A., Xie, Z., Wang, X., et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025

work page arXiv 2025

[76] [76]

Elastic diffusion transformer

Wang, J., Lai, Z., Chen, J., Guo, J., Guo, H., Li, X., Yue, X., and Guo, C. Elastic diffusion transformer. arXiv preprint arXiv:2602.13993, 2026

work page arXiv 2026

[77] [77]

Precisecache: Precise feature caching for efficient and high-fidelity video genera- tion

Wang, J., Zhao, K., Guo, J., Wang, J., Guo, H., Zhu, C., Yue, X., and Li, X. Precisecache: Precise feature caching for efficient and high-fidelity video genera- tion. InThe Fourteenth International Conference on Learning Representations, 2026. URL https:// openreview.net/forum?id=DjfRkr82jn

work page 2026

[78] [78]

Towards a golden classifier-free guidance path via foresight fixed point iterations.arXiv preprint arXiv:2510.21512, 2025

Wang, K., Mao, J., Wu, T., and Xiang, Y . Towards a golden classifier-free guidance path via foresight fixed point iterations.arXiv preprint arXiv:2510.21512, 2025

work page arXiv 2025

[79] [79]

On dis- crete prompt optimization for diffusion models.arXiv preprint arXiv:2407.01606, 2024

Wang, R., Liu, T., Hsieh, C.-J., and Gong, B. On dis- crete prompt optimization for diffusion models.arXiv preprint arXiv:2407.01606, 2024

work page arXiv 2024

[80] [80]

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Wang, Y ., Li, Z., Zang, Y ., Zhou, Y ., Bu, J., Wang, C., Lu, Q., Jin, C., and Wang, J. Pref-grpo: Pairwise pref- erence reward-based grpo for stable text-to-image rein- forcement learning.arXiv preprint arXiv:2508.20751, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025