pith. sign in

arxiv: 2605.15803 · v1 · pith:UGXFDZYJnew · submitted 2026-05-15 · 💻 cs.CV · cs.LG

Embedding-perturbed Exploration Preference Optimization for Flow Models

Pith reviewed 2026-05-20 18:28 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords preference optimizationflow modelsembedding perturbationreinforcement learningvariance maintenancehuman alignmentgenerative modelsexploration
0
0 comments X

The pith

Embedding-level perturbations within sample groups sustain variance and keep the learning signal alive during preference optimization for flow models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that group-based reinforcement learning methods for aligning generative models quickly lose intra-group differences, causing variance to collapse and removing the signal needed for stable optimization. To fix this, it introduces structured perturbations applied directly at the embedding stage inside each group of samples. These perturbations are designed to maintain useful variance without destroying the semantic meaning of the samples or the correctness of their preference labels. If successful, the approach allows continued learning throughout training instead of early stagnation or reward hacking. Experiments indicate the resulting flow models align more closely with human preferences than prior techniques.

Core claim

Embedding-perturbed Exploration Preference Optimization (E²PO) adds structured perturbations at the embedding level inside sample groups. This produces a sustained intra-group variance that preserves the discriminative signal required for optimization. The framework therefore avoids the variance collapse that occurs in standard group-based methods and yields flow models whose outputs match human preferences more faithfully than existing baselines.

What carries the argument

Embedding-level perturbation inside sample groups: a controlled addition of structured noise at the embedding stage that maintains variance while leaving semantic content and preference ordering unchanged.

If this is right

  • Optimization remains stable because a non-zero discriminative signal persists even late in training.
  • Flow models reach higher human-preference alignment without requiring larger group sizes or repeated noise resampling.
  • The risk of premature policy stagnation or reward hacking is reduced.
  • The same perturbation principle offers a direct alternative to variance-increasing tricks that have shown diminishing returns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same embedding perturbation idea could be tested on diffusion or autoregressive models to check whether variance maintenance is architecture-specific.
  • Smaller group sizes might become viable if the perturbation reliably supplies the missing signal, lowering per-step compute.
  • Measuring output diversity on downstream tasks after training would reveal whether the added variance also improves sample variety.

Load-bearing premise

Perturbations added at the embedding level will increase useful variance without corrupting the semantic validity of the samples or the accuracy of their preference labels.

What would settle it

An experiment that applies the embedding perturbations and then finds either collapsed variance across groups or generated samples whose human preference rankings differ from the unperturbed versions would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.15803 by Chubin Chen, Jiahong Wu, Jiashu Zhu, Sujie Hu, Xiangxiang Chu, Xiu Li.

Figure 1
Figure 1. Figure 1: (Top) Unlike baselines that suffer from Vanishing Discriminative Signal, E 2PO sustains Discriminative Variance through embedding-level perturbation. (Bottom) By injecting structured perturbations into the semantic manifold (Center), our method significantly expands the exploration space (Right) and achieves superior reward alignment (Left) compared to state-of-the-art methods. Abstract Recent advancements… view at source ↗
Figure 2
Figure 2. Figure 2: Evolution of Intra-Group Discriminative Variance During Training. Smoothed curves track the changing trends of variance statistics as training proceeds. The baseline’s standard deviation decline significantly, whereas E 2 PO maintains a consistent variance level, demonstrating sufficient intra-group discriminative variance throughout the optimization process. 4.1. Intra-group Discriminative Variance Effect… view at source ↗
Figure 3
Figure 3. Figure 3: Method Overview. We introduce (a) selective perturbation on content embeddings to induce discriminative signal, (b) a noise-aware schedule to modulate condition injection during sampling, and (c) a reference-anchored strategy that calculates gradients relative to the original prompt C orig to prevent semantic drift. within the embedding space at every training step, forcing the policy to constantly navigat… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Comparison of E 2PO against Baselines. E 2 PO demonstrates superior performance fidelity, spatial reasoning, instruction adherence and diversity, overcoming the limitations (highlighted in red circles) seen in other methods. Let Corig and C opt k denote the conditioning contexts derived from the original and the k-th optimized embeddings, re￾spectively. Noise-Aware Sampling Schedule. Recognizin… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation of Latent Group Size G and Number of Semantic Variants K. We analyze the trade-off between G and K under a fixed computational budget (N = G × K). We observe that the extreme configurations (G = 1 or K = 1) are insufficient, whereas a balanced split between G and K achieves the most stable and highest performance. w/ Noise-Aware Sampling Schedule (Ours) w/o Noise-Aware Sampling Schedule [PITH_FUL… view at source ↗
Figure 6
Figure 6. Figure 6: Ablation of Noise-Aware Sampling Schedule. Sam￾ples are generated at the 150-th training step. The static strategy (right) leads to semantic drift or artifacts, whereas our Noise-Aware Schedule (left) maintains high visual fidelity. vide insufficient exploration. This underscores the necessity of combining both noise-level and semantic-level diversity to effectively expand the exploration space. Second, re… view at source ↗
Figure 8
Figure 8. Figure 8: Human Preference Evaluation Results. We compare E 2 PO against SD3.5-M and RL-based baselines (DanceGRPO, Flow-GRPO, DiffusionNFT) across four key dimensions: Detail Preservation, Color Consistency, Image-Text Alignment, and Overall Quality. The results demonstrate that our method consistently achieves the highest user preference rates across all categories. B.1. E2PO Hyperparameters E2PO introduces a deco… view at source ↗
Figure 9
Figure 9. Figure 9: Ablation of Perturbation Scope. We compare perturbing the primary encoder (CLIP-L) against perturbing both encoders. The plot shows that limiting perturbation to CLIP-L ensures high-quality generation, while the dual-encoder strategy suffers from significant performance degradation and instability. C. Extended Experiments C.1. User Study To validate the effectiveness of E2PO in alignment with human prefere… view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative Comparison of E 2PO against SOTAs on the GenEval Benchmark. All RL-based methods are trained using GenEval as the reward model. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative Comparison of E 2PO against SOTAs on the PickScore Benchmark. All RL-based methods are trained using PickScore as the reward model. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
read the original abstract

Recent advancements have established Reinforcement Learning (RL) as a pivotal paradigm for aligning generative models with human intent. However, group-based optimization frameworks (e.g., GRPO) face a critical limitation: the rapid decay of intra-group variance. As the distinctiveness among samples within a group diminishes, the variance approaches zero. This eliminates the very learning signal required for optimization, rendering the process unstable and forcing the policy into premature stagnation or reward hacking. Existing strategies, such as varying the initial noise or increasing group sizes, often fail to address this fundamental issue, resulting in training instability or diminishing returns. To overcome these challenges, we propose $\textbf{Embedding-perturbed Exploration Preference Optimization (}E^2\textbf{PO)}$, a novel framework that sustains optimization through embedding-level perturbation. Our method introduces structured, embedding-level perturbations within sample groups, guaranteeing a robust variance that preserves the discriminative signal throughout the training process. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving a more faithful alignment with human preference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes Embedding-perturbed Exploration Preference Optimization (E²PO) for flow models. It identifies rapid decay of intra-group variance as a core limitation in group-based RL frameworks such as GRPO, which eliminates the learning signal and leads to instability or reward hacking. The method introduces structured perturbations at the embedding level within sample groups to sustain variance while preserving the discriminative signal and semantic validity. Experiments are claimed to show significant outperformance over state-of-the-art baselines with more faithful human-preference alignment.

Significance. If the embedding perturbations can be shown to increase useful variance without corrupting preference labels, the approach would offer a targeted engineering fix for a known instability in preference optimization of generative models. This could improve training stability for flow-based architectures without requiring larger groups or noise variation, provided the invariance property holds.

major comments (1)
  1. [Abstract and §3] Abstract and §3 (Method): The central claim that 'structured, embedding-level perturbations ... guaranteeing a robust variance that preserves the discriminative signal' requires that the perturbation operator leaves both semantic content and the correctness of human preference labels unchanged. No derivation, bound, or invariance argument is supplied showing that the perturbation commutes with the preference oracle or that embedding shifts do not cross decision boundaries corresponding to preference flips. This is load-bearing for the claim, as label corruption would turn the optimization objective into a misaligned surrogate.
minor comments (1)
  1. [Abstract] The abstract supplies no equations, implementation details, or quantitative metrics, which hinders immediate technical assessment even though this is acceptable for an abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review of our manuscript. The major comment raises an important point about the need for justification that our embedding perturbations preserve semantic content and preference labels. We address this below and outline the changes we will make in revision.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Method): The central claim that 'structured, embedding-level perturbations ... guaranteeing a robust variance that preserves the discriminative signal' requires that the perturbation operator leaves both semantic content and the correctness of human preference labels unchanged. No derivation, bound, or invariance argument is supplied showing that the perturbation commutes with the preference oracle or that embedding shifts do not cross decision boundaries corresponding to preference flips. This is load-bearing for the claim, as label corruption would turn the optimization objective into a misaligned surrogate.

    Authors: We agree that establishing preservation of semantic content and preference labels is central to the validity of E²PO. The current manuscript motivates the perturbations as small and structured within the embedding space of a fixed pre-trained encoder, with the claim supported by downstream empirical results showing improved alignment and stability. However, we acknowledge the absence of an explicit invariance argument or bound. In the revised manuscript we will add a new subsection in §3 that (i) provides a heuristic argument based on the local Lipschitz continuity of the embedding map and the small magnitude of the perturbations, (ii) reports an empirical label-consistency study in which human raters or a proxy preference model re-evaluate perturbed versus unperturbed pairs, and (iii) discusses the operating regime in which decision-boundary crossings are unlikely. These additions will make the load-bearing assumption explicit and testable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method presented as independent engineering intervention without reductive derivations

full rationale

The abstract and available text introduce E²PO as a novel framework that adds structured embedding-level perturbations to sustain intra-group variance in group-based RL optimization for flow models. No equations, derivations, or parameter-fitting steps are shown that reduce a claimed prediction or result back to the inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core claims. The variance-preservation benefit is asserted directly from the perturbation design rather than derived from fitted quantities or prior self-referential results. This is a standard non-circular engineering proposal; the derivation chain (if any exists in the full manuscript) does not exhibit the enumerated circular patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested assumption that embedding perturbations preserve semantic validity and on at least one tunable perturbation strength hyperparameter whose value is not reported.

free parameters (1)
  • perturbation magnitude
    Controls how strongly embeddings are altered; must be chosen or fitted to keep variance high without harming sample quality.
axioms (1)
  • domain assumption Structured embedding perturbations increase intra-group variance while leaving preference signals intact.
    This premise is required for the method to deliver a usable learning signal rather than noise or bias.

pith-pipeline@v0.9.0 · 5717 in / 1145 out tokens · 50153 ms · 2026-05-20T18:28:19.375871+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    EasyVFX decouples VFX generation via frequency-aware Mixture-of-Experts and test-time training to achieve realistic effects with limited resources.

Reference graph

Works this paper leans on

101 extracted references · 101 canonical work pages · cited by 1 Pith paper · 19 internal anchors

  1. [1]

    https://github.com/ discus0434/aesthetic-predictor-v2-5 ,

    Aesthetic predictor v2.5. https://github.com/ discus0434/aesthetic-predictor-v2-5 ,

  2. [2]

    Accessed: 2025-06-10

  3. [3]

    Ban, Y ., Wang, R., Zhou, T., Cheng, M., Gong, B., and Hsieh, C.-J. Understanding the impact of negative prompts: When and how do they take effect? In 8 E²PO: Embedding-perturbed Exploration Preference Optimization for Flow Models european conference on computer vision, pp. 190–206. Springer, 2024

  4. [4]

    Representa- tion learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelli- gence, 35(8):1798–1828, 2013

    Bengio, Y ., Courville, A., and Vincent, P. Representa- tion learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelli- gence, 35(8):1798–1828, 2013

  5. [5]

    Training diffusion models with reinforcement learn- ing

    Black, K., Janner, M., Du, Y ., Kostrikov, I., and Levine, S. Training diffusion models with reinforcement learn- ing. InThe Twelfth International Conference on Learn- ing Representations

  6. [6]

    Taming preference mode collapse via directional decoupling alignment in diffusion reinforcement learning.arXiv preprint arXiv:2512.24146, 2025

    Chen, C., Hu, S., Zhu, J., Wu, M., Chen, J., Li, Y ., Huang, N., Fang, C., Wu, J., Chu, X., et al. Taming preference mode collapse via directional decoupling alignment in diffusion reinforcement learning.arXiv preprint arXiv:2512.24146, 2025

  7. [7]

    Stochastic self- guidance for training-free enhancement of diffusion models

    Chen, C., Zhu, J., Feng, X., Huang, N., Zhu, C., Wu, M., Mao, F., Wu, J., Chu, X., and Li, X. Stochastic self- guidance for training-free enhancement of diffusion models. InThe Fourteenth International Conference on Learning Representations, 2026

  8. [8]

    T2i- copilot: A training-free multi-agent text-to-image sys- tem for enhanced prompt interpretation and interactive generation

    Chen, C.-Y ., Shi, M., Zhang, G., and Shi, H. T2i- copilot: A training-free multi-agent text-to-image sys- tem for enhanced prompt interpretation and interactive generation. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), pp. 19396–19405, October 2025

  9. [9]

    Unveiling chain of step reasoning for vision-language models with fine-grained rewards.Advances in Neural Information Processing Systems, 38:114703–114727, 2026

    Chen, H., Lou, X., Feng, X., Huang, K., and Wang, X. Unveiling chain of step reasoning for vision-language models with fine-grained rewards.Advances in Neural Information Processing Systems, 38:114703–114727, 2026

  10. [10]

    Conceptweaver: Weav- ing disentangled concepts with flow.arXiv preprint arXiv:2603.28493, 2026

    Chen, J., Hao, A., Chen, X., Bai, C., Chen, C., Li, Y ., Wu, J., Chu, X., and Zhang, S. Conceptweaver: Weav- ing disentangled concepts with flow.arXiv preprint arXiv:2603.28493, 2026

  11. [11]

    Contextflow: Training-free video object editing via adaptive context enrichment.arXiv preprint arXiv:2509.17818, 2025

    Chen, Y ., He, X., Ma, X., and Ma, Y . Contextflow: Training-free video object editing via adaptive context enrichment.arXiv preprint arXiv:2509.17818, 2025

  12. [12]

    Chen, B., Martí Monsó, D., Du, Y ., Simchowitz, M., Tedrake, R., and Sitzmann, V

    Chung, H., Kim, J., Park, G. Y ., Nam, H., and Ye, J. C. Cfg++: Manifold-constrained classifier free guidance for diffusion models.arXiv preprint arXiv:2406.08070, 2024

  13. [13]

    Clark, K., Vicol, P., Swersky, K., and Fleet, D. J. Di- rectly fine-tuning diffusion models on differentiable rewards.arXiv preprint arXiv:2309.17400, 2023

  14. [14]

    and Nichol, A

    Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021

  15. [15]

    Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362,

    Fan, Y . and Lee, K. Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362, 2023

  16. [16]

    Dpok: Reinforcement learning for fine- tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858– 79885, 2023

    Fan, Y ., Watkins, O., Du, Y ., Liu, H., Ryu, M., Boutilier, C., Abbeel, P., Ghavamzadeh, M., Lee, K., and Lee, K. Dpok: Reinforcement learning for fine- tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858– 79885, 2023

  17. [17]

    Integrating extra modality helps segmentor find camouflaged objects well.arXiv preprint arXiv:2502.14471, 2025

    Fang, C., He, C., Tang, L., Zhang, Y ., Zhu, C., Shen, Y ., Chen, C., Xu, G., and Li, X. Integrating extra modality helps segmentor find camouflaged objects well.arXiv preprint arXiv:2502.14471, 2025

  18. [18]

    PRISM: Rethinking Scattered Atmosphere Reconstruction as a Unified Understanding and Generation Model for Real-world Dehazing

    Fang, C., He, C., Zhang, Y ., Chen, C., Zhu, C., Tang, L., and Li, X. Prism: Rethinking scattered atmosphere reconstruction as a unified understanding and gener- ation model for real-world dehazing.arXiv preprint arXiv:2604.07048, 2026

  19. [19]

    Dit4edit: Diffusion transformer for image editing

    Feng, K., Ma, Y ., Wang, B., Qi, C., Chen, H., Chen, Q., and Wang, Z. Dit4edit: Diffusion transformer for image editing. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 2969–2977, 2025

  20. [20]

    Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation.arXiv preprint arXiv:2507.11245, 2025

    Feng, X., Yu, H., Wu, M., Hu, S., Chen, J., Zhu, C., Wu, J., Chu, X., and Huang, K. Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation.arXiv preprint arXiv:2507.11245, 2025

  21. [21]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Gal, R., Alaluf, Y ., Atzmon, Y ., Patashnik, O., Bermano, A. H., Chechik, G., and Cohen-Or, D. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022

  22. [22]

    Geneval: An object-focused framework for evaluating text-to- image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Ghosh, D., Hajishirzi, H., and Schmidt, L. Geneval: An object-focused framework for evaluating text-to- image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  23. [23]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek- r1: Incentivizing reasoning capability in llms via rein- forcement learning.arXiv preprint arXiv:2501.12948, 2025

  24. [24]

    When Less is More: The LLM Scaling Paradox in Context Compression

    Guo, R., Liu, Y ., Ma, G., Wang, Y ., Zhang, Y ., Xia, L., Chen, K., Sun, Z., and Shi, D. When less is more: The llm scaling paradox in context compression.arXiv preprint arXiv:2602.09789, 2026. 9 E²PO: Embedding-perturbed Exploration Preference Optimization for Flow Models

  25. [25]

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter

    Gupta, S., Ahuja, C., Lin, T.-Y ., Roy, S. D., Oost- erhuis, H., de Rijke, M., and Shukla, S. N. A sim- ple and effective reinforcement learning method for text-to-image diffusion fine-tuning.arXiv preprint arXiv:2503.00897, 2025

  26. [26]

    Vigor-bench: How far are visual generative models from zero-shot visual reasoners?, 2026

    Han, H., Huang, J., Sun, X., He, J., Yang, R., Hu, J., Peng, X., Ma, L., Wei, X., and Li, X. Vigor-bench: How far are visual generative models from zero-shot visual reasoners?, 2026. URL https://arxiv. org/abs/2603.25823

  27. [27]

    Camouflaged object detection with feature decomposition and edge reconstruction

    He, C., Li, K., Zhang, Y ., Tang, L., Zhang, Y ., Guo, Z., and Li, X. Camouflaged object detection with feature decomposition and edge reconstruction. InCVPR, pp. 22046–22055, 2023

  28. [28]

    Strategic preys make acute predators: Enhancing camouflaged object detectors by generating camouflaged objects.ICLR, 2024

    He, C., Li, K., Zhang, Y ., Zhang, Y ., Guo, Z., Li, X., Danelljan, M., and Yu, F. Strategic preys make acute predators: Enhancing camouflaged object detectors by generating camouflaged objects.ICLR, 2024

  29. [29]

    Reti-diff: Illumina- tion degradation image restoration with retinex-based latent diffusion model.ICLR, 2025

    He, C., Fang, C., Zhang, Y ., Ye, T., Li, K., Tang, L., Guo, Z., Li, X., and Farsiu, S. Reti-diff: Illumina- tion degradation image restoration with retinex-based latent diffusion model.ICLR, 2025

  30. [30]

    Segment concealed object with incomplete supervision.TPAMI, 2025

    He, C., Li, K., Zhang, Y ., Yang, Z., Tang, L., Zhang, Y ., Kong, L., and Farsiu, S. Segment concealed object with incomplete supervision.TPAMI, 2025

  31. [31]

    Diffusion models in low-level vision: A survey.TPAMI, 2025

    He, C., Shen, Y ., Fang, C., Xiao, F., Tang, L., Zhang, Y ., Zuo, W., Guo, Z., and Li, X. Diffusion models in low-level vision: A survey.TPAMI, 2025

  32. [32]

    Reversible unfolding network for concealed visual perception with generative refinement.arXiv preprint arXiv:2508.15027, 2025

    He, C., Xiao, F., Zhang, R., Fang, C., Fan, D.-P., and Farsiu, S. Reversible unfolding network for concealed visual perception with generative refinement.arXiv preprint arXiv:2508.15027, 2025

  33. [33]

    UnfoldLDM: Degradation-Aware Unfolding with Iterative Latent Diffusion Priors for Blind Image Restoration

    He, C., Zhang, R., Chen, Z., Yang, B., Fang, C., Lin, Y ., Xiao, F., and Farsiu, S. Unfoldldm: Deep unfolding- based blind image restoration with latent diffusion priors.arXiv preprint arXiv:2511.18152, 2025

  34. [34]

    Refining context-entangled content segmen- tation via curriculum selection and anti-curriculum promotion.ICML, 2026

    He, C., Zhang, R., Xiao, F., Zhang, D., Cao, Z., and Farsiu, S. Refining context-entangled content segmen- tation via curriculum selection and anti-curriculum promotion.ICML, 2026

  35. [35]

    TempFlow-GRPO: When Timing Matters for GRPO in Flow Models

    He, X., Fu, S., Zhao, Y ., Li, W., Yang, J., Yin, D., Rao, F., and Zhang, B. Tempflow-grpo: When tim- ing matters for grpo in flow models.arXiv preprint arXiv:2508.04324, 2025

  36. [36]

    Classifier-Free Diffusion Guidance

    Ho, J. and Salimans, T. Classifier-free diffusion guid- ance.arXiv preprint arXiv:2207.12598, 2022

  37. [37]

    Sam-r1: Leveraging sam for reward feedback in multimodal segmentation via reinforcement learning.Advances in Neural Informa- tion Processing Systems, 38:138362–138383, 2026

    Huang, J., Xu, Z., Zhou, J., Liu, T., Xiao, Y ., Ou, M., Ji, B., Li, X., and Yuan, K. Sam-r1: Leveraging sam for reward feedback in multimodal segmentation via reinforcement learning.Advances in Neural Informa- tion Processing Systems, 38:138362–138383, 2026

  38. [38]

    Mate: Images are all you need for material transfer via diffusion transformer

    Huang, N., Liu, H., Lin, Y ., Huang, K., Chen, C., Guo, J., Lee, T.-y., and Li, X. Mate: Images are all you need for material transfer via diffusion transformer. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15117–15126, 2025

  39. [39]

    Tod3cap: Towards 3d dense captioning in outdoor scenes

    Jin, B., Zheng, Y ., Li, P., Li, W., Zheng, Y ., Hu, S., Liu, X., Zhu, J., Yan, Z., Sun, H., et al. Tod3cap: Towards 3d dense captioning in outdoor scenes. In European Conference on Computer Vision, pp. 367–

  40. [40]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652– 36663, 2023

    Kirstain, Y ., Polyak, A., Singer, U., Matiana, S., Penna, J., and Levy, O. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652– 36663, 2023

  41. [41]

    Crafting papers on machine learning

    Langley, P. Crafting papers on machine learning. In Langley, P. (ed.),Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann

  42. [42]

    Lee, K., Liu, H., Ryu, M., Watkins, O., Du, Y ., Boutilier, C., Abbeel, P., Ghavamzadeh, M., and Gu, S. S. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023

  43. [43]

    MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    Li, J., Cui, Y ., Huang, T., Ma, Y ., Fan, C., Yang, M., and Zhong, Z. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025

  44. [44]

    Reneg: Learning negative embedding with reward guidance

    Li, X., Liu, Y ., Isobe, T., Jia, X., Cui, Q., Zhou, D., Li, D., He, Y ., Lu, H., Wang, Z., et al. Reneg: Learning negative embedding with reward guidance. InProceed- ings of the Computer Vision and Pattern Recognition Conference, pp. 23636–23645, 2025

  45. [45]

    Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

    Li, Y ., Wang, Y ., Zhu, Y ., Zhao, Z., Lu, M., She, Q., and Zhang, S. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

  46. [46]

    Lipman, Y ., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling,

  47. [47]

    URL https://arxiv.org/abs/2210. 02747

  48. [48]

    Diversegrpo: Mitigating mode collapse in image generation via diversity-aware grpo.arXiv preprint arXiv:2512.21514, 2025

    Liu, H., Huang, H., Wang, J., Liu, C., Li, X., and Ji, X. Diversegrpo: Mitigating mode collapse in image generation via diversity-aware grpo.arXiv preprint arXiv:2512.21514, 2025. 10 E²PO: Embedding-perturbed Exploration Preference Optimization for Flow Models

  49. [49]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Liu, J., Liu, G., Liang, J., Li, Y ., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-grpo: Train- ing flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

  50. [50]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  51. [51]

    Omnidiff: A comprehensive benchmark for fine- grained image difference captioning

    Liu, Y ., Hou, S., Hou, S., Du, J., Meng, S., and Huang, Y . Omnidiff: A comprehensive benchmark for fine- grained image difference captioning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 21440–21449, 2025

  52. [52]

    Controllable layer decomposition for reversible multi-layer image generation.arXiv preprint arXiv:2511.16249, 2025

    Liu, Z., Xu, Z., Shu, S., Zhou, J., Zhang, R., Tang, Z., and Li, X. Controllable layer decomposition for reversible multi-layer image generation.arXiv preprint arXiv:2511.16249, 2025

  53. [53]

    Follow-your-shape: Shape-aware image edit- ing via trajectory-guided region control.arXiv preprint arXiv:2508.08134, 2025

    Long, Z., Zheng, M., Feng, K., Zhang, X., Liu, H., Yang, H., Zhang, L., Chen, Q., and Ma, Y . Follow-your-shape: Shape-aware image editing via trajectory-guided region control.arXiv preprint arXiv:2508.08134, 2025

  54. [54]

    Stage: Stable and generalizable grpo for autoregressive image generation.arXiv preprint arXiv:2509.25027, 2025

    Ma, X., Qiu, H., Zhang, G., Zeng, Z., Yang, S., Ma, L., and Zhao, F. Stage: Stable and generalizable grpo for autoregressive image generation, 2025. URL https: //arxiv.org/abs/2509.25027

  55. [55]

    MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

    Ma, X., Lei, J., Ren, T., Huang, J., Fu, S., Hao, A., Wu, J., Chu, X., and Zhao, F. Mar-grpo: Stabilized grpo for ar-diffusion hybrid image generation, 2026. URL https://arxiv.org/abs/2604.06966

  56. [56]

    Follow your pose: Pose-guided text-to- video generation using pose-free videos

    Ma, Y ., He, Y ., Cun, X., Wang, X., Chen, S., Li, X., and Chen, Q. Follow your pose: Pose-guided text-to- video generation using pose-free videos. InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 4117–4125, 2024

  57. [57]

    Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation

    Ma, Y ., Liu, H., Wang, H., Pan, H., He, Y ., Yuan, J., Zeng, A., Cai, C., Shum, H.-Y ., Liu, W., et al. Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation. InSIGGRAPH Asia 2024 Conference Papers, pp. 1–12, 2024

  58. [58]

    Controllable video generation: A survey.arXiv preprint arXiv:2507.16869,

    Ma, Y ., Feng, K., Hu, Z., Wang, X., Wang, Y ., Zheng, M., He, X., Zhu, C., Liu, H., He, Y ., et al. Con- trollable video generation: A survey.arXiv preprint arXiv:2507.16869, 2025

  59. [59]

    Follow-your-creation: Empowering 4d creation through video inpainting.arXiv preprint arXiv:2506.04590, 2025

    Ma, Y ., Feng, K., Zhang, X., Liu, H., Zhang, D. J., Xing, J., Zhang, Y ., Yang, A., Wang, Z., and Chen, Q. Follow-your-creation: Empowering 4d creation through video inpainting.arXiv preprint arXiv:2506.04590, 2025

  60. [60]

    Follow- your-click: Open-domain regional image animation via motion prompts

    Ma, Y ., He, Y ., Wang, H., Wang, A., Shen, L., Qi, C., Ying, J., Cai, C., Li, Z., Shum, H.-Y ., et al. Follow- your-click: Open-domain regional image animation via motion prompts. InProceedings of the AAAI Con- ference on Artificial Intelligence, volume 39, pp. 6018– 6026, 2025

  61. [61]

    Follow-your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning.arXiv preprint arXiv:2506.05207, 2025

    Ma, Y ., Liu, Y ., Zhu, Q., Yang, A., Feng, K., Zhang, X., Li, Z., Han, S., Qi, C., and Chen, Q. Follow- your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning.arXiv preprint arXiv:2506.05207, 2025

  62. [62]

    Follow-your-emoji-faster: To- wards efficient, fine-controllable, and expressive freestyle portrait animation.arXiv preprint arXiv:2509.16630, 2025

    Ma, Y ., Yan, Z., Liu, H., Wang, H., Pan, H., He, Y ., Yuan, J., Zeng, A., Cai, C., Shum, H.-Y ., et al. Follow- your-emoji-faster: Towards efficient, fine-controllable, and expressive freestyle portrait animation.arXiv preprint arXiv:2509.16630, 2025

  63. [63]

    Omni-effects: Unified and spatially-controllable visual effects gener- ation

    Mao, F., Hao, A., Chen, J., Liu, D., Feng, X., Zhu, J., Wu, M., Chen, C., Wu, J., and Chu, X. Omni-effects: Unified and spatially-controllable visual effects gener- ation. InProceedings of the AAAI Conference on Arti- ficial Intelligence, volume 40, pp. 7927–7935, 2026

  64. [64]

    Training-free generation of diverse and high-fidelity images via prompt semantic space optimization, 2025

    Meng, D., Jin, C., Gao, Z., Li, Y ., Patras, I., and Tz- imiropoulos, G. Training-free generation of diverse and high-fidelity images via prompt semantic space optimization, 2025. URL https://arxiv.org/ abs/2511.19811

  65. [65]

    Training diffusion models to- wards diverse image generation with reinforcement learning

    Miao, Z., Wang, J., Wang, Z., Yang, Z., Wang, L., Qiu, Q., and Liu, Z. Training diffusion models to- wards diverse image generation with reinforcement learning. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pp. 10844–10853, 2024

  66. [66]

    Training language models to follow instructions with human feedback.Advances in neu- ral information processing systems, 35:27730–27744, 2022

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wain- wright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback.Advances in neu- ral information processing systems, 35:27730–27744, 2022

  67. [67]

    Video diffusion alignment via reward gradients.arXiv preprint arXiv:2407.08737, 2024

    Prabhudesai, M., Mendonca, R., Qin, Z., Fragkiadaki, K., and Pathak, D. Video diffusion alignment via reward gradients.arXiv preprint arXiv:2407.08737, 2024

  68. [68]

    gradient descent

    Pryzant, R., Iter, D., Li, J., Lee, Y . T., Zhu, C., and Zeng, M. Automatic prompt optimization with” gradient descent” and beam search.arXiv preprint arXiv:2305.03495, 2023

  69. [69]

    High-resolution image synthesis with la- tent diffusion models

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with la- tent diffusion models. InProceedings of the IEEE/CVF 11 E²PO: Embedding-perturbed Exploration Preference Optimization for Flow Models conference on computer vision and pattern recogni- tion, pp. 10684–10695, 2022

  70. [70]

    Dreambooth: Fine tuning text-to- image diffusion models for subject-driven generation

    Ruiz, N., Li, Y ., Jampani, V ., Pritch, Y ., Rubinstein, M., and Aberman, K. Dreambooth: Fine tuning text-to- image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pp. 22500–22510, 2023

  71. [71]

    Uncertainty-masked bernoulli diffusion for camouflaged object detection refinement.arXiv preprint arXiv:2506.10712, 2025

    Shen, Y ., Xiao, F., Hu, S., Pang, Y ., Pu, Y ., Fang, C., Li, X., and He, C. Uncertainty-masked bernoulli diffusion for camouflaged object detection refinement.arXiv preprint arXiv:2506.10712, 2025

  72. [72]

    Follow-your-preference: Towards preference- aligned image inpainting.arXiv preprint arXiv:2509.23082,

    Shen, Y ., Yuan, J., Aonishi, T., Nakayama, H., and Ma, Y . Follow-your-preference: Towards preference-aligned image inpainting.arXiv preprint arXiv:2509.23082, 2025

  73. [73]

    Defining and characterizing reward gaming.Ad- vances in Neural Information Processing Systems, 35: 9460–9471, 2022

    Skalse, J., Howe, N., Krasheninnikov, D., and Krueger, D. Defining and characterizing reward gaming.Ad- vances in Neural Information Processing Systems, 35: 9460–9471, 2022

  74. [74]

    Tam- ing rectified flow for inversion and editing

    Wang, J., Pu, J., Qi, Z., Guo, J., Ma, Y ., Huang, N., Chen, Y ., Li, X., and Shan, Y . Taming recti- fied flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024

  75. [75]

    Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025a

    Wang, J., Liang, J., Liu, J., Liu, H., Liu, G., Zheng, J., Pang, W., Ma, A., Xie, Z., Wang, X., et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025

  76. [76]

    Elastic diffusion transformer

    Wang, J., Lai, Z., Chen, J., Guo, J., Guo, H., Li, X., Yue, X., and Guo, C. Elastic diffusion transformer. arXiv preprint arXiv:2602.13993, 2026

  77. [77]

    Precisecache: Precise feature caching for efficient and high-fidelity video genera- tion

    Wang, J., Zhao, K., Guo, J., Wang, J., Guo, H., Zhu, C., Yue, X., and Li, X. Precisecache: Precise feature caching for efficient and high-fidelity video genera- tion. InThe Fourteenth International Conference on Learning Representations, 2026. URL https:// openreview.net/forum?id=DjfRkr82jn

  78. [78]

    Towards a golden classifier-free guidance path via foresight fixed point iterations.arXiv preprint arXiv:2510.21512, 2025

    Wang, K., Mao, J., Wu, T., and Xiang, Y . Towards a golden classifier-free guidance path via foresight fixed point iterations.arXiv preprint arXiv:2510.21512, 2025

  79. [79]

    On dis- crete prompt optimization for diffusion models.arXiv preprint arXiv:2407.01606, 2024

    Wang, R., Liu, T., Hsieh, C.-J., and Gong, B. On dis- crete prompt optimization for diffusion models.arXiv preprint arXiv:2407.01606, 2024

  80. [80]

    Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

    Wang, Y ., Li, Z., Zang, Y ., Zhou, Y ., Bu, J., Wang, C., Lu, Q., Jin, C., and Wang, J. Pref-grpo: Pairwise pref- erence reward-based grpo for stable text-to-image rein- forcement learning.arXiv preprint arXiv:2508.20751, 2025

Showing first 80 references.