pith. machine review for the scientific record. sign in

arxiv: 2605.09433 · v1 · submitted 2026-05-10 · 💻 cs.CV

Recognition: no theorem link

Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs

Hengyuan Cao, Min Zhang, Qichao Wang, Xiaoyin Xu, Yunhong Lu

Pith reviewed 2026-05-12 02:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords preference optimizationrectified flowtext-to-image generationoffline alignmentnoise trackingDPOtrajectory estimation
0
0 comments X

The pith

Keeping the exact prior noise sample for each winner and loser image tightens the preference objective for rectified flow generators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix a mismatch in preference alignment for rectified flow text-to-image models. Existing datasets only record final images, so prior methods reconstruct trajectories with fresh noise samples that do not match the model's actual path. By retaining the original noise paired with each image and using straight-line interpolation between noise and image, the approach creates a more faithful surrogate loss. A dynamic regularizer that scales with the reward gap and training progress further stabilizes the process. The result is improved preference metrics at lower training cost on current rectified-flow backbones.

Core claim

Augmenting each preference triplet into a sextuple that includes the paired prior noises, then estimating intermediate states via linear noise-image interpolation under the straight-line property of rectified flow, produces a tighter surrogate for the preference optimization objective; an adaptive regularizer that depends on the winner-loser reward gap and training progress further improves stability and sample efficiency.

What carries the argument

Prior Noise-Aware Preference Optimization (PNAPO), which augments data to (prompt, winner image, loser image, winner prior noise, loser prior noise) sextuples and substitutes noise-image linear interpolation for independent noising when constructing the DPO-style loss.

If this is right

  • Trajectory estimation variance drops because the interpolation is constrained to the model's actual generation path rather than an independent forward process.
  • The dynamic regularizer reduces the need for manual hyperparameter tuning by automatically weakening the penalty when the reward gap is large or training is early.
  • The method works entirely offline on existing preference datasets once the prior noises are added, without requiring online sampling or model changes at inference time.
  • Training compute decreases substantially because fewer gradient steps are needed to reach the same or better preference scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Preference data pipelines for flow-based generators should routinely record and store the initial noise sample alongside each image to enable this form of alignment.
  • The same noise-tracking idea could be tested on other straight-trajectory generative models to see whether the compute savings generalize beyond rectified flow.
  • If the straight-line assumption proves robust, future alignment work might shift from full-trajectory simulation to simple interpolation, lowering the barrier for smaller research groups.

Load-bearing premise

That storing the exact prior noise for every image in a preference dataset remains practical at scale and that straight-line interpolation between noise and image introduces no systematic bias into the preference signal.

What would settle it

A controlled experiment on the same preference pairs showing that replacing the recorded prior noises with freshly sampled independent noises produces equal or higher final preference metrics and equal or lower compute cost would falsify the claimed advantage.

Figures

Figures reproduced from arXiv: 2605.09433 by Hengyuan Cao, Min Zhang, Qichao Wang, Xiaoyin Xu, Yunhong Lu.

Figure 1
Figure 1. Figure 1: Our PNAPO achieves self-improvement by utilizing prior noise distributions and dynamically adjusting gradient updates. and, more recently, rectified flow (Esser et al., 2024) and flow-matching (Lipman et al., 2022) variants. Despite their success, high-capacity T2I models still exhibit persistent failure modes: imperfect text rendering (Chen et al., 2023), compositional errors (Huang et al., 2023), spatial… view at source ↗
Figure 2
Figure 2. Figure 2: Prior Noise Matters. Compared to FLUX, our PNAPO-FLUX generates images with superior text-image alignment, enhanced visual aesthetics and realism, particularly in resolving FLUX’s characteristic background blurring issues. These advancements parallel how LLMs address hallucination, as both represent implicit optimizations of human preference alignment. the generation process is inherently trajectory-based:… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of PNAPO versus DPO baselines. Compared to Diffusion-DPO’s stochastic noise injection, PNAPO employs a prior noise with xT -x0 interpolation for more accurate estimation, while surpassing D3PO in efficiency by avoiding iterative reverse processes. Additionally, dynamic regularization generation leverages δr reward gaps and training step n. where t ∼ U(0, T) and we define the s t θ as: s t θ (x ∗… view at source ↗
Figure 4
Figure 4. Figure 4: User Study and Qualitative Comparison. Top, human evaluations show PNAPO-FLUX significantly outperforming DPO￾FLUX and the base FLUX model. Bottom, we present qualitative comparisons between PNAPO and Diffusion-DPO when applied to the FLUX and SD3-M. The results demonstrate that our model achieves superior image generation quality. flow models for T2I generation. For each model, we utilize 20,000 prompts f… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative ablation comparison of our proposed improvements [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Preference Dataset samples generated by FLUX. Both the quality improvements achieved through rectification and the effectiveness of HPSv2.1 in discriminating between sample variations. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Preference Dataset samples generated by FLUX. Both the quality improvements achieved through rectification and the effectiveness of HPSv2.1 in discriminating between sample variations. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional Qualitative results. Compared to FLUX base model, images aligned with our PNAPO demonstrate significant improvements in both text-image alignment and aesthetic quality, effectively validating the superiority of our approach. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
read the original abstract

Existing preference datasets for text-to-image models typically store only the final winner/loser images. This representation is insufficient for rectified flow (RF) models, whose generation is naturally indexed by a specific prior noise sample and follows a nearly straight denoising trajectory. In contrast, prior DPO-style alignment for diffusion models commonly estimates trajectories using an independent forward noising process, which can be mismatched to the true reverse dynamics and introduces unnecessary variance. We propose Prior Noise-Aware Preference Optimization (PNAPO), an off-policy alignment framework specialized for rectified flow. PNAPO augments preference data by retaining the paired prior noises used to generate each winner/loser image, turning the standard (prompt, winner, loser) triplet into a sextuple. Leveraging the straight-line property of RF, we estimate intermediate states via noise-image interpolation, which constrains the trajectory estimation space and yields a tighter surrogate objective for preference optimization. In addition, we introduce a dynamic regularization strategy that adapts the DPO regularization based on (i) the reward gap between winner and loser and (ii) training progress, improving stability and sample efficiency. Experiments on state-of-the-art RF T2I backbones show that PNAPO consistently improves preference metrics while substantially reducing training compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Prior Noise-Aware Preference Optimization (PNAPO) for offline alignment of rectified flow (RF) text-to-image models. It augments standard preference triplets with the paired prior noise samples used to generate each winner and loser image, then leverages the (nearly) straight-line property of RF trajectories to interpolate intermediate states and form a tighter surrogate for the preference loss. A dynamic regularization term is added that adapts the DPO-style penalty using the observed reward gap and training progress. The authors claim that this yields consistent gains on preference metrics while substantially lowering training compute relative to prior diffusion-style DPO on state-of-the-art RF backbones.

Significance. If the interpolation is shown to be unbiased and the reported gains are supported by proper baselines and ablations, the work would offer a practical, RF-specific improvement to offline preference optimization. Retaining and reusing the original prior noises directly addresses a known mismatch between independent noising and the true reverse dynamics of RF models, which could translate into more sample-efficient alignment for a class of generators that is currently dominant in text-to-image research.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (method): the central claim that noise-image interpolation produces a 'tighter surrogate' rests on the assumption that RF trajectories are sufficiently linear for the interpolated points to lie on the true denoising path. The abstract itself qualifies the trajectories as only 'nearly straight,' yet no error analysis, trajectory deviation measurements, or ablation comparing interpolated versus actual intermediate states is referenced. This assumption is load-bearing for both the correctness of the preference loss and the claimed compute reduction versus independent-noising DPO.
  2. [Experiments] Experiments section: the abstract asserts that PNAPO 'consistently improves preference metrics while substantially reducing training compute,' but the provided text supplies no quantitative numbers, baseline comparisons, or ablation results. Without these, it is impossible to judge effect sizes, the relative contribution of the interpolation versus the dynamic regularizer, or whether the compute savings are robust across model scales.
minor comments (2)
  1. [§3.1] The transition from the conventional (prompt, winner, loser) triplet to the proposed sextuple is described in prose but would benefit from an explicit equation or table defining the augmented data structure and the interpolation formula.
  2. [§3.3] Notation for the dynamic regularization schedule (reward-gap and progress terms) is introduced without a clear equation reference; adding a numbered equation would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify our contributions to offline preference optimization for rectified flow models. We address each major comment point-by-point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method): the central claim that noise-image interpolation produces a 'tighter surrogate' rests on the assumption that RF trajectories are sufficiently linear for the interpolated points to lie on the true denoising path. The abstract itself qualifies the trajectories as only 'nearly straight,' yet no error analysis, trajectory deviation measurements, or ablation comparing interpolated versus actual intermediate states is referenced. This assumption is load-bearing for both the correctness of the preference loss and the claimed compute reduction versus independent-noising DPO.

    Authors: We appreciate the referee highlighting the importance of rigorously justifying the linearity assumption underlying our interpolation approach. Rectified flow trajectories are known to follow (nearly) straight paths in expectation, a property central to the RF formulation and leveraged in prior RF literature to enable efficient sampling. Our noise-tracked interpolation exploits this to constrain the trajectory space and reduce variance relative to independent noising. We agree, however, that an explicit error analysis would make the 'tighter surrogate' claim more robust. In the revised manuscript we will add a dedicated subsection (in §3 or the appendix) that quantifies trajectory deviation by comparing linearly interpolated states against actual intermediate outputs from the RF model on a held-out set of samples. We will also include an ablation that measures the effect of this approximation on the preference loss and overall alignment performance. revision: yes

  2. Referee: [Experiments] Experiments section: the abstract asserts that PNAPO 'consistently improves preference metrics while substantially reducing training compute,' but the provided text supplies no quantitative numbers, baseline comparisons, or ablation results. Without these, it is impossible to judge effect sizes, the relative contribution of the interpolation versus the dynamic regularizer, or whether the compute savings are robust across model scales.

    Authors: We acknowledge that the experimental results need to be presented with greater clarity and detail to allow proper evaluation of effect sizes and component contributions. The full manuscript contains an experiments section (§4) reporting results on state-of-the-art RF text-to-image backbones, but we agree these should be expanded for accessibility. In the revision we will augment §4 with explicit quantitative tables showing preference metric improvements (e.g., win rates), direct baseline comparisons against diffusion-style DPO and other offline methods, ablations isolating the noise-aware interpolation and dynamic regularizer, and compute measurements (training FLOPs and wall-clock time) across multiple model scales. These additions will make the claimed gains and efficiency benefits fully transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external RF properties and observable quantities

full rationale

The paper augments preference data with retained prior noises (an implementation choice) and derives the surrogate objective by interpolating along the line connecting noise to image, explicitly leveraging the known straight-line property of rectified flows as stated in the abstract. This interpolation is not self-definitional because the linearity assumption originates from prior RF literature rather than being fitted or defined within the present objective. The dynamic regularization term adapts explicitly to the observable reward gap between winner/loser and to training progress, both of which are external to the loss itself and not obtained by fitting the target quantity. No equation reduces a prediction to a fitted parameter by construction, no uniqueness theorem is imported via self-citation, and no ansatz is smuggled through prior work by the same authors. The central claim therefore remains independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that rectified flow trajectories are sufficiently straight to support linear interpolation between noise and image.

axioms (1)
  • domain assumption Rectified flow models follow nearly straight denoising trajectories from prior noise to image.
    Invoked to justify estimating intermediate states via noise-image interpolation.

pith-pipeline@v0.9.0 · 5527 in / 1162 out tokens · 36666 ms · 2026-05-12T02:05:03.855800+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 12 internal anchors

  1. [1]

    Training Diffusion Models with Reinforcement Learning

    Black, K., Janner, M., Du, Y ., Kostrikov, I., and Levine, S. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301,

  2. [2]

    Preference-based alignment of discrete diffusion mod- els.arXiv preprint arXiv:2503.08295,

    Borso, U., Paglieri, D., Wells, J., and Rockt ¨aschel, T. Preference-based alignment of discrete diffusion mod- els.arXiv preprint arXiv:2503.08295,

  3. [3]

    Dimension-reduction attack! video generative models are experts on controllable image synthesis.arXiv preprint arXiv:2505.23325, 2025a

    Cao, H., Feng, Y ., Gong, B., Tian, Y ., Lu, Y ., Liu, C., and Wang, B. Dimension-reduction attack! video generative models are experts on controllable image synthesis.arXiv preprint arXiv:2505.23325, 2025a. Cao, H., Lu, Y ., Wang, Q., Li, T., Xu, X., and Zhang, M. Adversarial self flow matching: Few-steps image generation with straight flows, 2025b. URL ...

  4. [4]

    Clark, K., Vicol, P., Swersky, K., and Fleet, D. J. Directly fine-tuning diffusion models on differentiable rewards. arXiv preprint arXiv:2309.17400,

  5. [5]

    T., Sebe, N., and Shah, M

    Croitoru, F.-A., Hondru, V ., Ionescu, R. T., Sebe, N., and Shah, M. Curriculum direct preference optimization for diffusion and consistency models.arXiv preprint arXiv:2405.13637,

  6. [6]

    Personalized preference fine-tuning of diffusion models

    13 Submission and Formatting Instructions for ICML 2026 Dang, M., Singh, A., Zhou, L., Ermon, S., and Song, J. Personalized preference fine-tuning of diffusion models. arXiv preprint arXiv:2501.06655,

  7. [7]

    Raft: Reward ranked finetuning for generative foundation model alignment.arXiv preprint arXiv:2304.06767, 2023

    Dong, H., Xiong, W., Goyal, D., Zhang, Y ., Chow, W., Pan, R., Diao, S., Zhang, J., Shum, K., and Zhang, T. Raft: Re- ward ranked finetuning for generative foundation model alignment.arXiv preprint arXiv:2304.06767,

  8. [8]

    Per- sonalized image editing in text-to-image diffusion models via collaborative direct preference optimization.arXiv preprint arXiv:2511.05616,

    Dunlop, C., Zheng, M., Venkatesh, K., and Yanardag, P. Per- sonalized image editing in text-to-image diffusion models via collaborative direct preference optimization.arXiv preprint arXiv:2511.05616,

  9. [9]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Muller, J., Saini, H., Levi, Y ., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Good- win, A., Marek, Y ., and Rombach, R. Scaling rectified flow transformers for high-resolution image synthesis. ArXiv, abs/2403.03206,

  10. [10]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

  11. [11]

    Chats: Combining human-aligned optimization and test-time sampling for text-to-image generation.arXiv preprint arXiv:2502.12579,

    Fu, M., Wang, G.-H., Cao, L., Chen, Q.-G., Xu, Z., Luo, W., and Zhang, K. Chats: Combining human-aligned optimization and test-time sampling for text-to-image generation.arXiv preprint arXiv:2502.12579,

  12. [12]

    Gadre, S. Y ., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., Orgad, E., Entezari, R., Daras, G., Pratt, S., Ramanujan, V ., Bitton, Y ., Marathe, K., Mussmann, S., Vencu, R., Cherti, M., Krishna, R., Koh, P. W., Saukh, O., Ratner, A. J., Song, S., Hajishirzi, H., Farhadi, A., Beaumont, R., Oh, ...

  13. [13]

    In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 11170–11189, 2024

    Hong, J., Lee, N., and Thorne, J. Orpo: Monolithic pref- erence optimization without reference model.arXiv preprint arXiv:2403.07691, 2024a. Hong, J., Paul, S., Lee, N., Rasul, K., Thorne, J., and Jeong, J. Margin-aware preference optimization for aligning diffusion models without reference. InFirst Workshop on Scalable Optimization for Efficient and Adap...

  14. [14]

    Towards better alignment: Training diffusion models with reinforcement learning against sparse rewards.arXiv preprint arXiv:2503.11240, 2025a

    Hu, Z., Zhang, F., Chen, L., Kuang, K., Li, J., Gao, K., Xiao, J., Wang, X., and Zhu, W. Towards better alignment: Training diffusion models with reinforcement learning against sparse rewards.arXiv preprint arXiv:2503.11240, 2025a. Hu, Z., Zhang, F., and Kuang, K. D-fusion: Direct prefer- ence optimization for aligning diffusion models with visu- ally con...

  15. [15]

    Patchdpo: Patch-level dpo for finetuning- free personalized image generation.arXiv preprint arXiv:2412.03177,

    Huang, Q., Chan, L., Liu, J., He, W., Jiang, H., Song, M., and Song, J. Patchdpo: Patch-level dpo for finetuning- free personalized image generation.arXiv preprint arXiv:2412.03177,

  16. [16]

    Open image prefer- ences v1

    14 Submission and Formatting Instructions for ICML 2026 is Better-Together, D. Open image prefer- ences v1. https://huggingface.co/ datasets/data-is-better-together/ open-image-preferences-v1,

  17. [17]

    Diffusion tree sampling: Scalable inference-time alignment of diffusion models, 2025

    Jain, V ., Sareen, K., Pedramfar, M., and Ravanbakhsh, S. Diffusion tree sampling: Scalable inference- time alignment of diffusion models.arXiv preprint arXiv:2506.20701,

  18. [18]

    Elucidating the Design Space of Diffusion-Based Generative Models

    Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the design space of diffusion-based generative models. ArXiv, abs/2206.00364,

  19. [19]

    Scalable ranked preference optimization for text- to-image generation.arXiv preprint arXiv:2410.18013,

    Karthik, S., Coskun, H., Akata, Z., Tulyakov, S., Ren, J., and Kag, A. Scalable ranked preference optimization for text- to-image generation.arXiv preprint arXiv:2410.18013,

  20. [20]

    Lee, J.-Y ., Cha, B., Kim, J., and Ye, J. C. Aligning text to image in diffusion models is easier than you think.arXiv preprint arXiv:2503.08250, 2025a. Lee, K., Liu, H., Ryu, M., Watkins, O., Du, Y ., Boutilier, C., Abbeel, P., Ghavamzadeh, M., and Gu, S. S. Aligning text- to-image models using human feedback.arXiv preprint arXiv:2302.12192,

  21. [21]

    Calibrated multi- preference optimization for aligning diffusion models

    Lee, K., Li, X., Wang, Q., He, J., Ke, J., Yang, M.-H., Essa, I., Shin, J., Yang, F., and Li, Y . Calibrated multi- preference optimization for aligning diffusion models. arXiv preprint arXiv:2502.02588, 2025b. Li, J., Cui, Y ., Huang, T., Ma, Y ., Fan, C., Yang, M., and Zhong, Z. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv prep...

  22. [22]

    Aligning diffusion models by optimizing human utility

    Li, S., Kallidromitis, K., Gokul, A., Kato, Y ., and Kozuka, K. Aligning diffusion models by optimizing human utility. arXiv preprint arXiv:2404.04465,

  23. [23]

    Step-aware preference optimization: Aligning preference with denoising performance at each step.arXiv preprint arXiv:2406.04314, 2024

    Liang, Y ., He, J., Li, G., Li, P., Klimovskiy, A., Carolan, N., Sun, J., Pont-Tuset, J., Young, S., Yang, F., et al. Rich human feedback for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19401–19411, 2024a. Liang, Z., Yuan, Y ., Gu, S., Chen, B., Hang, T., Li, J., and Zheng, L. Step-aw...

  24. [24]

    Towards un- derstanding camera motions in any video.arXiv preprint arXiv:2504.15376,

    Lin, Z., Cen, S., Jiang, D., Karhade, J., Wang, H., Mitra, C., Ling, T., Huang, Y ., Liu, S., Chen, M., et al. Towards un- derstanding camera motions in any video.arXiv preprint arXiv:2504.15376,

  25. [25]

    Lin, Z., Mitra, C., Cen, S., Li, I., Huang, Y ., Ling, Y . T. T., Wang, H., Pi, I., Zhu, S., Rao, R., et al. Building a precise video language with human-ai oversight.arXiv preprint arXiv:2604.21718,

  26. [26]

    Flow Matching for Generative Modeling

    Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

  27. [27]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Liu, J., Liu, G., Liang, J., Li, Y ., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025a. Liu, J., Liu, G., Liang, J., Yuan, Z., Liu, X., Zheng, M., Wu, X., Wang, Q., Qin, W., Xia, M., et al. Improving video generation with human feedback.arXiv preprint arXiv:...

  28. [28]

    Discovery of the reward function for embodied reinforcement learning agents.Nature Communications, 16(1):11064, 2025a

    Lu, R., Shao, Z., Ding, Y ., Chen, R., Wu, D., Su, H., Yang, T., Zhang, F., Wang, J., Shi, Y ., et al. Discovery of the reward function for embodied reinforcement learning agents.Nature Communications, 16(1):11064, 2025a. Lu, Y ., Wang, Q., Cao, H., Wang, X., Xu, X., and Zhang, M. Inpo: Inversion preference optimization with reparametrized ddim for effici...

  29. [29]

    Step-video-t2v tech- nical report: The practice, challenges, and future of video foundation model

    Ma, G., Huang, H., Yan, K., Chen, L., Duan, N., Yin, S., Wan, C., Ming, R., Song, X., Chen, X., et al. Step- video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248,

  30. [30]

    Tuning timestep-distilled diffusion model using pairwise sample optimization.arXiv preprint arXiv:2410.03190,

    Miao, Z., Yang, Z., Lin, K., Wang, Z., Liu, Z., Wang, L., and Qiu, Q. Tuning timestep-distilled diffusion model using pairwise sample optimization.arXiv preprint arXiv:2410.03190,

  31. [31]

    J., Kang, W., and Moon, I.-C

    Na, B., Park, M., Sim, G., Shin, D., Bae, H., Kang, M., Kwon, S. J., Kang, W., and Moon, I.-C. Diffusion adap- tive text embedding for text-to-image diffusion models. arXiv preprint arXiv:2510.23974,

  32. [32]

    Boost your human image gen- eration model via direct preference optimization.arXiv preprint arXiv:2405.20216,

    Na, S., Kim, Y ., and Lee, H. Boost your human image gen- eration model via direct preference optimization.arXiv preprint arXiv:2405.20216,

  33. [33]

    Peebles, W. S. and Xie, S. Scalable diffusion models with transformers.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4172–4182,

  34. [34]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Muller, J., Penna, J., and Rombach, R. Sdxl: Im- proving latent diffusion models for high-resolution image synthesis.ArXiv, abs/2307.01952,

  35. [35]

    Re- fining alignment framework for diffusion models with intermediate-step preference ranking.arXiv preprint arXiv:2502.01667,

    Ren, J., Zhang, Y ., Liu, D., Zhang, X., and Tian, Q. Re- fining alignment framework for diffusion models with intermediate-step preference ranking.arXiv preprint arXiv:2502.01667,

  36. [36]

    S., Shu, Z., Zhang, J., Jung, H., Gerig, G., and Zhang, H

    Ren, M., Xiong, W., Yoon, J. S., Shu, Z., Zhang, J., Jung, H., Gerig, G., and Zhang, H. Relightful harmonization: Lighting-aware portrait background replacement.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6452–6462,

  37. [37]

    High-resolution image synthesis with la- tent diffusion models.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with la- tent diffusion models.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10674–10685,

  38. [38]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

  39. [39]

    Prioritize denois- ing steps on diffusion model preference alignment via explicit denoised distribution estimation.arXiv preprint arXiv:2411.14871,

    Shi, D., Wang, Y ., Li, H., and Chu, X. Prioritize denois- ing steps on diffusion model preference alignment via explicit denoised distribution estimation.arXiv preprint arXiv:2411.14871,

  40. [40]

    Denoising Diffusion Implicit Models

    Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,

  41. [41]

    16 Submission and Formatting Instructions for ICML 2026 Tee, J. T. J., Yoon, H. S., Syarubany, A. H. M., Yoon, E., and Yoo, C. D. A gradient guidance perspective on stepwise preference optimization for diffusion models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. Wallace, B., Dang, M., Rafailov, R., Zhou, L., Lou, A., Pu...

  42. [42]

    Diffusion- npo: Negative preference optimization for better pref- erence aligned generation of diffusion models.arXiv preprint arXiv:2505.11245, 2025a

    Wang, F.-Y ., Shui, Y ., Piao, J., Sun, K., and Li, H. Diffusion- npo: Negative preference optimization for better pref- erence aligned generation of diffusion models.arXiv preprint arXiv:2505.11245, 2025a. Wang, F.-Y ., Sun, K., Teng, Y ., Liu, X., Song, J., and Li, H. Self-npo: Negative preference optimization of diffusion models by simply learning from...

  43. [43]

    URL https:// arxiv.org/abs/2605.03877. Wang, Z. J., Montoya, E., Munechika, D., Yang, H., Hoover, B., and Chau, D. H. Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models.arXiv preprint arXiv:2210.14896,

  44. [44]

    Human preference score: Better aligning text-to-image models with human preference

    Wu, X., Sun, K., Zhu, F., Zhao, R., and Li, H. Human preference score: Better aligning text-to-image models with human preference. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2096– 2105,

  45. [45]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Xue, Z., Wu, J., Gao, Y ., Kong, F., Zhu, L., Chen, M., Liu, Z., Liu, W., Guo, Q., Huang, W., et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,

  46. [46]

    Using human feedback to fine-tune diffusion models without any reward model

    Yang, K., Tao, J., Lyu, J., Ge, C., Chen, J., Shen, W., Zhu, X., and Li, X. Using human feedback to fine-tune diffusion models without any reward model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8941–8951, 2024a. Yang, S., Chen, T., and Zhou, M. A dense reward view on aligning text-to-image diffusion with pre...

  47. [47]

    Rrhf: Rank responses to align language models with human feedback without tears

    Yuan, Z., Yuan, H., Tan, C., Wang, W., Huang, S., and Huang, F. Rrhf: Rank responses to align language mod- els with human feedback without tears.arXiv preprint arXiv:2304.05302,

  48. [48]

    Learning multi-dimensional human prefer- ence for text-to-image generation

    Zhang, S., Wang, B., Wu, J., Li, Y ., Gao, T., Zhang, D., and Wang, Z. Learning multi-dimensional human prefer- ence for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8018–8027, 2024a. Zhang, T., Da, C., Ding, K., Yang, H., Jin, K., Li, Y ., Gao, T., Zhang, D., Xiang, S., and Pan, C. Dif...

  49. [49]

    Large-scale reinforcement learning for diffusion models

    Zhang, Y ., Tzeng, E., Du, Y ., and Kislyuk, D. Large-scale reinforcement learning for diffusion models. InEuro- pean Conference on Computer Vision, pp. 1–17. Springer, 2024b. Zheng, K., Chen, Y ., Chen, H., He, G., Liu, M.-Y ., Zhu, J., and Zhang, Q. Direct discriminative optimization: Your likelihood-based visual generative model is secretly a gan discr...

  50. [50]

    Background A.1

    17 Submission and Formatting Instructions for ICML 2026 A. Background A.1. More Related Works Conditional Generative Models.Diffusion models belong to a family of generative approaches that create data through an iterative denoising process. These models learn to reverse a predefined forward process that gradually adds noise to data. By capitalizing on ne...

  51. [51]

    However, PPO’s computational intensity and complex optimization landscape frequently pose implementation challenges

    as its core algorithm, which necessitates simultaneous operation of multiple model components - including the active policy, reference model, value function estimator, and reward predictor. However, PPO’s computational intensity and complex optimization landscape frequently pose implementation challenges. To mitigate these issues, researchers have develop...

  52. [52]

    These advancements collectively represent significant progress in developing more efficient and theoretically grounded alignment techniques

    enhances LLM training through comparative reward information. These advancements collectively represent significant progress in developing more efficient and theoretically grounded alignment techniques. Further Preference Optimization of Diffusion Models.The application of preference alignment techniques extends well beyond text-to-image diffusion models ...

  53. [53]

    Promising future directions (Cao et al., 2025a;b; Lin et al., 2026; 2025; Lu et al., 2025a; Wang et al.,

    remains a nascent research area. Promising future directions (Cao et al., 2025a;b; Lin et al., 2026; 2025; Lu et al., 2025a; Wang et al.,

  54. [54]

    18 Submission and Formatting Instructions for ICML 2026 B

    may involve transferring alignment methodologies from large language models to generative visual systems (Borso et al., 2025; Lee et al., 2025a; Zheng et al., 2025; Lu et al., 2025d), as well as expanding these techniques to novel sensory modalities including auditory and haptic domains (Huang et al., 2024; Shi et al., 2024). 18 Submission and Formatting ...

  55. [55]

    Details of the Primary Derivation In this section, we present a detailed derivation of our proposed method

    19 Submission and Formatting Instructions for ICML 2026 C. Details of the Primary Derivation In this section, we present a detailed derivation of our proposed method. Following Diffusion-DPO, we define the reward on the whole chain: r(x0,c) =E pθ(x1:T |x0,c)[r(x0:T ,c)].(25) We begin with the objective function of RLHF: max pθ Ex0∼pθ(x0|c)[r(x0,c)]/β−D KL...

  56. [56]

    Exw 1:T−1 ∼pc θ (xw 1:T−1 |xw 0 ,xw T ) xl 1:T−1 ∼pc θ (xl 1:T−1 |xl 0,xl T ) log pc θ(xw 0:T ) pc ref (xw 0:T ) −log pc θ(xl 0:T ) pc ref (xl 0:T )   .(31) Given x∗ T , pc θ(x∗ 1:T−1 |x∗ 0,x ∗ T ) becomes tractable if we estimate it using pc θ(x∗ 1:T−1 |x∗ T ), though this approach is evidently resource-intensive. Leveraging the inherent straightness o...

  57. [57]

    Exw t−1,t∼q(xw t−1,t|xw 0 ,xw T ) xl t−1,t∼q(xl t−1,t|xl 0,xl T ) log pc θ(xw t−1|xw t ) pc ref (xw t−1|xw t ) −log pc θ(xl t−1|xl t) pc ref (xl t−1|xl t)   =−E D logσ βTE tExw T ∼pθ (xw T |xw 0 ),xl T ∼pθ (xl T |xl 0)Exw t ∼q(xw t |xw 0 ,xw T ),xl t∼q(xl t|xl 0,xl T ) Exw t ∼q(xw t−1|xw 0 ,xw t ,xw T ),xl t−1∼q(xl t−1|xl 0,xl t,xl T ) log pc θ(xw t−1|x...

  58. [58]

    E.2. Off-Policy Data Construction We present a subset of samples generated by FLUX as Figure 6 and Figure 7, with comparative assessments conducted using HPSv2.1 (Human Preference Score v2.1) as the evaluation metric. These visualizations demonstrate both the quality improvements achieved through rectification and the effectiveness of HPSv2.1 in discrimin...

  59. [59]

    Fly an airplane\

    Mobile photo of a 747 plane crashing through traffic on the highway. Vintage lithograph of barack obama doing a fortnite dance. Vending machine for crack cocaine cigarette, soviet propaganda style. Floral wallpaper with orange pastel colors. Figure 6.Preference Dataset samples generated by FLUX. Both the quality improvements achieved through rectification...