arxiv: 2605.09433 · v1 · submitted 2026-05-10 · 💻 cs.CV

Recognition: no theorem link

Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs

Hengyuan Cao, Min Zhang, Qichao Wang, Xiaoyin Xu, Yunhong Lu

Pith reviewed 2026-05-12 02:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords preference optimizationrectified flowtext-to-image generationoffline alignmentnoise trackingDPOtrajectory estimation

0 comments

The pith

Keeping the exact prior noise sample for each winner and loser image tightens the preference objective for rectified flow generators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix a mismatch in preference alignment for rectified flow text-to-image models. Existing datasets only record final images, so prior methods reconstruct trajectories with fresh noise samples that do not match the model's actual path. By retaining the original noise paired with each image and using straight-line interpolation between noise and image, the approach creates a more faithful surrogate loss. A dynamic regularizer that scales with the reward gap and training progress further stabilizes the process. The result is improved preference metrics at lower training cost on current rectified-flow backbones.

Core claim

Augmenting each preference triplet into a sextuple that includes the paired prior noises, then estimating intermediate states via linear noise-image interpolation under the straight-line property of rectified flow, produces a tighter surrogate for the preference optimization objective; an adaptive regularizer that depends on the winner-loser reward gap and training progress further improves stability and sample efficiency.

What carries the argument

Prior Noise-Aware Preference Optimization (PNAPO), which augments data to (prompt, winner image, loser image, winner prior noise, loser prior noise) sextuples and substitutes noise-image linear interpolation for independent noising when constructing the DPO-style loss.

If this is right

Trajectory estimation variance drops because the interpolation is constrained to the model's actual generation path rather than an independent forward process.
The dynamic regularizer reduces the need for manual hyperparameter tuning by automatically weakening the penalty when the reward gap is large or training is early.
The method works entirely offline on existing preference datasets once the prior noises are added, without requiring online sampling or model changes at inference time.
Training compute decreases substantially because fewer gradient steps are needed to reach the same or better preference scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Preference data pipelines for flow-based generators should routinely record and store the initial noise sample alongside each image to enable this form of alignment.
The same noise-tracking idea could be tested on other straight-trajectory generative models to see whether the compute savings generalize beyond rectified flow.
If the straight-line assumption proves robust, future alignment work might shift from full-trajectory simulation to simple interpolation, lowering the barrier for smaller research groups.

Load-bearing premise

That storing the exact prior noise for every image in a preference dataset remains practical at scale and that straight-line interpolation between noise and image introduces no systematic bias into the preference signal.

What would settle it

A controlled experiment on the same preference pairs showing that replacing the recorded prior noises with freshly sampled independent noises produces equal or higher final preference metrics and equal or lower compute cost would falsify the claimed advantage.

Figures

Figures reproduced from arXiv: 2605.09433 by Hengyuan Cao, Min Zhang, Qichao Wang, Xiaoyin Xu, Yunhong Lu.

**Figure 1.** Figure 1: Our PNAPO achieves self-improvement by utilizing prior noise distributions and dynamically adjusting gradient updates. and, more recently, rectified flow (Esser et al., 2024) and flow-matching (Lipman et al., 2022) variants. Despite their success, high-capacity T2I models still exhibit persistent failure modes: imperfect text rendering (Chen et al., 2023), compositional errors (Huang et al., 2023), spatial… view at source ↗

**Figure 2.** Figure 2: Prior Noise Matters. Compared to FLUX, our PNAPO-FLUX generates images with superior text-image alignment, enhanced visual aesthetics and realism, particularly in resolving FLUX’s characteristic background blurring issues. These advancements parallel how LLMs address hallucination, as both represent implicit optimizations of human preference alignment. the generation process is inherently trajectory-based:… view at source ↗

**Figure 3.** Figure 3: Comparison of PNAPO versus DPO baselines. Compared to Diffusion-DPO’s stochastic noise injection, PNAPO employs a prior noise with xT -x0 interpolation for more accurate estimation, while surpassing D3PO in efficiency by avoiding iterative reverse processes. Additionally, dynamic regularization generation leverages δr reward gaps and training step n. where t ∼ U(0, T) and we define the s t θ as: s t θ (x ∗… view at source ↗

**Figure 4.** Figure 4: User Study and Qualitative Comparison. Top, human evaluations show PNAPO-FLUX significantly outperforming DPOFLUX and the base FLUX model. Bottom, we present qualitative comparisons between PNAPO and Diffusion-DPO when applied to the FLUX and SD3-M. The results demonstrate that our model achieves superior image generation quality. flow models for T2I generation. For each model, we utilize 20,000 prompts f… view at source ↗

**Figure 5.** Figure 5: Qualitative ablation comparison of our proposed improvements [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Preference Dataset samples generated by FLUX. Both the quality improvements achieved through rectification and the effectiveness of HPSv2.1 in discriminating between sample variations. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Preference Dataset samples generated by FLUX. Both the quality improvements achieved through rectification and the effectiveness of HPSv2.1 in discriminating between sample variations. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Additional Qualitative results. Compared to FLUX base model, images aligned with our PNAPO demonstrate significant improvements in both text-image alignment and aesthetic quality, effectively validating the superiority of our approach. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

read the original abstract

Existing preference datasets for text-to-image models typically store only the final winner/loser images. This representation is insufficient for rectified flow (RF) models, whose generation is naturally indexed by a specific prior noise sample and follows a nearly straight denoising trajectory. In contrast, prior DPO-style alignment for diffusion models commonly estimates trajectories using an independent forward noising process, which can be mismatched to the true reverse dynamics and introduces unnecessary variance. We propose Prior Noise-Aware Preference Optimization (PNAPO), an off-policy alignment framework specialized for rectified flow. PNAPO augments preference data by retaining the paired prior noises used to generate each winner/loser image, turning the standard (prompt, winner, loser) triplet into a sextuple. Leveraging the straight-line property of RF, we estimate intermediate states via noise-image interpolation, which constrains the trajectory estimation space and yields a tighter surrogate objective for preference optimization. In addition, we introduce a dynamic regularization strategy that adapts the DPO regularization based on (i) the reward gap between winner and loser and (ii) training progress, improving stability and sample efficiency. Experiments on state-of-the-art RF T2I backbones show that PNAPO consistently improves preference metrics while substantially reducing training compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PNAPO keeps the original noise samples with preference pairs for rectified flows and interpolates along the line to tighten the DPO surrogate, but the gains depend on an assumption about near-linearity that the abstract itself flags as approximate.

read the letter

The paper's main move is to change how preference data is stored for rectified flow models. Instead of just keeping the final winner and loser images, it retains the specific prior noise that generated each one. This creates a sextuple that lets them interpolate intermediate points directly along the straight line from noise to image when building the preference loss. That is a direct response to the mismatch with standard DPO-style methods that re-noise independently and can drift from the model's actual reverse path. The dynamic regularization that scales with reward gap and training progress is a sensible addition for stability. Both pieces feel like honest engineering for the RF setting rather than a generic extension. The abstract positions the work as specialized, and the noise-tracking step does address a real representation gap in existing datasets. The claim of lower training compute follows if the interpolation really constrains the surrogate without adding bias. The soft spot is that everything rests on the interpolation being a good enough proxy. The abstract calls RF trajectories only nearly straight, so curvature or model-specific deviations could make the estimated points miss the true denoising path and weaken the tighter-surrogate argument. No numbers, baselines, or ablation tables appear in the abstract, and the reader's note confirms the full experiments were not reviewed here. That leaves the practical gains unverified for now. This is for researchers who already work with rectified flow text-to-image models and want to adapt preference tuning without the usual trajectory mismatch. A reader who cares about efficient alignment for this class of generators could extract the noise-tracking idea even if the full results need checking. I would send it to peer review because the specialization targets a concrete limitation in current practice and the method is simple enough to test quickly, though the authors will need to show the interpolation does not introduce measurable bias.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Prior Noise-Aware Preference Optimization (PNAPO) for offline alignment of rectified flow (RF) text-to-image models. It augments standard preference triplets with the paired prior noise samples used to generate each winner and loser image, then leverages the (nearly) straight-line property of RF trajectories to interpolate intermediate states and form a tighter surrogate for the preference loss. A dynamic regularization term is added that adapts the DPO-style penalty using the observed reward gap and training progress. The authors claim that this yields consistent gains on preference metrics while substantially lowering training compute relative to prior diffusion-style DPO on state-of-the-art RF backbones.

Significance. If the interpolation is shown to be unbiased and the reported gains are supported by proper baselines and ablations, the work would offer a practical, RF-specific improvement to offline preference optimization. Retaining and reusing the original prior noises directly addresses a known mismatch between independent noising and the true reverse dynamics of RF models, which could translate into more sample-efficient alignment for a class of generators that is currently dominant in text-to-image research.

major comments (2)

[Abstract and §3] Abstract and §3 (method): the central claim that noise-image interpolation produces a 'tighter surrogate' rests on the assumption that RF trajectories are sufficiently linear for the interpolated points to lie on the true denoising path. The abstract itself qualifies the trajectories as only 'nearly straight,' yet no error analysis, trajectory deviation measurements, or ablation comparing interpolated versus actual intermediate states is referenced. This assumption is load-bearing for both the correctness of the preference loss and the claimed compute reduction versus independent-noising DPO.
[Experiments] Experiments section: the abstract asserts that PNAPO 'consistently improves preference metrics while substantially reducing training compute,' but the provided text supplies no quantitative numbers, baseline comparisons, or ablation results. Without these, it is impossible to judge effect sizes, the relative contribution of the interpolation versus the dynamic regularizer, or whether the compute savings are robust across model scales.

minor comments (2)

[§3.1] The transition from the conventional (prompt, winner, loser) triplet to the proposed sextuple is described in prose but would benefit from an explicit equation or table defining the augmented data structure and the interpolation formula.
[§3.3] Notation for the dynamic regularization schedule (reward-gap and progress terms) is introduced without a clear equation reference; adding a numbered equation would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify our contributions to offline preference optimization for rectified flow models. We address each major comment point-by-point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method): the central claim that noise-image interpolation produces a 'tighter surrogate' rests on the assumption that RF trajectories are sufficiently linear for the interpolated points to lie on the true denoising path. The abstract itself qualifies the trajectories as only 'nearly straight,' yet no error analysis, trajectory deviation measurements, or ablation comparing interpolated versus actual intermediate states is referenced. This assumption is load-bearing for both the correctness of the preference loss and the claimed compute reduction versus independent-noising DPO.

Authors: We appreciate the referee highlighting the importance of rigorously justifying the linearity assumption underlying our interpolation approach. Rectified flow trajectories are known to follow (nearly) straight paths in expectation, a property central to the RF formulation and leveraged in prior RF literature to enable efficient sampling. Our noise-tracked interpolation exploits this to constrain the trajectory space and reduce variance relative to independent noising. We agree, however, that an explicit error analysis would make the 'tighter surrogate' claim more robust. In the revised manuscript we will add a dedicated subsection (in §3 or the appendix) that quantifies trajectory deviation by comparing linearly interpolated states against actual intermediate outputs from the RF model on a held-out set of samples. We will also include an ablation that measures the effect of this approximation on the preference loss and overall alignment performance. revision: yes
Referee: [Experiments] Experiments section: the abstract asserts that PNAPO 'consistently improves preference metrics while substantially reducing training compute,' but the provided text supplies no quantitative numbers, baseline comparisons, or ablation results. Without these, it is impossible to judge effect sizes, the relative contribution of the interpolation versus the dynamic regularizer, or whether the compute savings are robust across model scales.

Authors: We acknowledge that the experimental results need to be presented with greater clarity and detail to allow proper evaluation of effect sizes and component contributions. The full manuscript contains an experiments section (§4) reporting results on state-of-the-art RF text-to-image backbones, but we agree these should be expanded for accessibility. In the revision we will augment §4 with explicit quantitative tables showing preference metric improvements (e.g., win rates), direct baseline comparisons against diffusion-style DPO and other offline methods, ablations isolating the noise-aware interpolation and dynamic regularizer, and compute measurements (training FLOPs and wall-clock time) across multiple model scales. These additions will make the claimed gains and efficiency benefits fully transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external RF properties and observable quantities

full rationale

The paper augments preference data with retained prior noises (an implementation choice) and derives the surrogate objective by interpolating along the line connecting noise to image, explicitly leveraging the known straight-line property of rectified flows as stated in the abstract. This interpolation is not self-definitional because the linearity assumption originates from prior RF literature rather than being fitted or defined within the present objective. The dynamic regularization term adapts explicitly to the observable reward gap between winner/loser and to training progress, both of which are external to the loss itself and not obtained by fitting the target quantity. No equation reduces a prediction to a fitted parameter by construction, no uniqueness theorem is imported via self-citation, and no ansatz is smuggled through prior work by the same authors. The central claim therefore remains independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that rectified flow trajectories are sufficiently straight to support linear interpolation between noise and image.

axioms (1)

domain assumption Rectified flow models follow nearly straight denoising trajectories from prior noise to image.
Invoked to justify estimating intermediate states via noise-image interpolation.

pith-pipeline@v0.9.0 · 5527 in / 1162 out tokens · 36666 ms · 2026-05-12T02:05:03.855800+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 12 internal anchors

[1]

Training Diffusion Models with Reinforcement Learning

Black, K., Janner, M., Du, Y ., Kostrikov, I., and Levine, S. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Preference-based alignment of discrete diffusion mod- els.arXiv preprint arXiv:2503.08295,

Borso, U., Paglieri, D., Wells, J., and Rockt ¨aschel, T. Preference-based alignment of discrete diffusion mod- els.arXiv preprint arXiv:2503.08295,

work page arXiv
[3]

Dimension-reduction attack! video generative models are experts on controllable image synthesis.arXiv preprint arXiv:2505.23325, 2025a

Cao, H., Feng, Y ., Gong, B., Tian, Y ., Lu, Y ., Liu, C., and Wang, B. Dimension-reduction attack! video generative models are experts on controllable image synthesis.arXiv preprint arXiv:2505.23325, 2025a. Cao, H., Lu, Y ., Wang, Q., Li, T., Xu, X., and Zhang, M. Adversarial self flow matching: Few-steps image generation with straight flows, 2025b. URL ...

work page arXiv
[4]

Clark, K., Vicol, P., Swersky, K., and Fleet, D. J. Directly fine-tuning diffusion models on differentiable rewards. arXiv preprint arXiv:2309.17400,

work page arXiv
[5]

T., Sebe, N., and Shah, M

Croitoru, F.-A., Hondru, V ., Ionescu, R. T., Sebe, N., and Shah, M. Curriculum direct preference optimization for diffusion and consistency models.arXiv preprint arXiv:2405.13637,

work page arXiv
[6]

Personalized preference fine-tuning of diffusion models

13 Submission and Formatting Instructions for ICML 2026 Dang, M., Singh, A., Zhou, L., Ermon, S., and Song, J. Personalized preference fine-tuning of diffusion models. arXiv preprint arXiv:2501.06655,

work page arXiv 2026
[7]

Raft: Reward ranked finetuning for generative foundation model alignment.arXiv preprint arXiv:2304.06767, 2023

Dong, H., Xiong, W., Goyal, D., Zhang, Y ., Chow, W., Pan, R., Diao, S., Zhang, J., Shum, K., and Zhang, T. Raft: Re- ward ranked finetuning for generative foundation model alignment.arXiv preprint arXiv:2304.06767,

work page arXiv
[8]

Per- sonalized image editing in text-to-image diffusion models via collaborative direct preference optimization.arXiv preprint arXiv:2511.05616,

Dunlop, C., Zheng, M., Venkatesh, K., and Yanardag, P. Per- sonalized image editing in text-to-image diffusion models via collaborative direct preference optimization.arXiv preprint arXiv:2511.05616,

work page arXiv
[9]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Muller, J., Saini, H., Levi, Y ., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Good- win, A., Marek, Y ., and Rombach, R. Scaling rectified flow transformers for high-resolution image synthesis. ArXiv, abs/2403.03206,

work page internal anchor Pith review arXiv
[10]

KTO: Model Alignment as Prospect Theoretic Optimization

Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

work page internal anchor Pith review arXiv
[11]

Chats: Combining human-aligned optimization and test-time sampling for text-to-image generation.arXiv preprint arXiv:2502.12579,

Fu, M., Wang, G.-H., Cao, L., Chen, Q.-G., Xu, Z., Luo, W., and Zhang, K. Chats: Combining human-aligned optimization and test-time sampling for text-to-image generation.arXiv preprint arXiv:2502.12579,

work page arXiv
[12]

Gadre, S. Y ., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., Orgad, E., Entezari, R., Daras, G., Pratt, S., Ramanujan, V ., Bitton, Y ., Marathe, K., Mussmann, S., Vencu, R., Cherti, M., Krishna, R., Koh, P. W., Saukh, O., Ratner, A. J., Song, S., Hajishirzi, H., Farhadi, A., Beaumont, R., Oh, ...

work page arXiv
[13]

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 11170–11189, 2024

Hong, J., Lee, N., and Thorne, J. Orpo: Monolithic pref- erence optimization without reference model.arXiv preprint arXiv:2403.07691, 2024a. Hong, J., Paul, S., Lee, N., Rasul, K., Thorne, J., and Jeong, J. Margin-aware preference optimization for aligning diffusion models without reference. InFirst Workshop on Scalable Optimization for Efficient and Adap...

work page arXiv
[14]

Towards better alignment: Training diffusion models with reinforcement learning against sparse rewards.arXiv preprint arXiv:2503.11240, 2025a

Hu, Z., Zhang, F., Chen, L., Kuang, K., Li, J., Gao, K., Xiao, J., Wang, X., and Zhu, W. Towards better alignment: Training diffusion models with reinforcement learning against sparse rewards.arXiv preprint arXiv:2503.11240, 2025a. Hu, Z., Zhang, F., and Kuang, K. D-fusion: Direct prefer- ence optimization for aligning diffusion models with visu- ally con...

work page arXiv
[15]

Patchdpo: Patch-level dpo for finetuning- free personalized image generation.arXiv preprint arXiv:2412.03177,

Huang, Q., Chan, L., Liu, J., He, W., Jiang, H., Song, M., and Song, J. Patchdpo: Patch-level dpo for finetuning- free personalized image generation.arXiv preprint arXiv:2412.03177,

work page arXiv
[16]

Open image prefer- ences v1

14 Submission and Formatting Instructions for ICML 2026 is Better-Together, D. Open image prefer- ences v1. https://huggingface.co/ datasets/data-is-better-together/ open-image-preferences-v1,

work page 2026
[17]

Diffusion tree sampling: Scalable inference-time alignment of diffusion models, 2025

Jain, V ., Sareen, K., Pedramfar, M., and Ravanbakhsh, S. Diffusion tree sampling: Scalable inference- time alignment of diffusion models.arXiv preprint arXiv:2506.20701,

work page arXiv
[18]

Elucidating the Design Space of Diffusion-Based Generative Models

Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the design space of diffusion-based generative models. ArXiv, abs/2206.00364,

work page internal anchor Pith review arXiv
[19]

Scalable ranked preference optimization for text- to-image generation.arXiv preprint arXiv:2410.18013,

Karthik, S., Coskun, H., Akata, Z., Tulyakov, S., Ren, J., and Kag, A. Scalable ranked preference optimization for text- to-image generation.arXiv preprint arXiv:2410.18013,

work page arXiv
[20]

Lee, J.-Y ., Cha, B., Kim, J., and Ye, J. C. Aligning text to image in diffusion models is easier than you think.arXiv preprint arXiv:2503.08250, 2025a. Lee, K., Liu, H., Ryu, M., Watkins, O., Du, Y ., Boutilier, C., Abbeel, P., Ghavamzadeh, M., and Gu, S. S. Aligning text- to-image models using human feedback.arXiv preprint arXiv:2302.12192,

work page arXiv
[21]

Calibrated multi- preference optimization for aligning diffusion models

Lee, K., Li, X., Wang, Q., He, J., Ke, J., Yang, M.-H., Essa, I., Shin, J., Yang, F., and Li, Y . Calibrated multi- preference optimization for aligning diffusion models. arXiv preprint arXiv:2502.02588, 2025b. Li, J., Cui, Y ., Huang, T., Ma, Y ., Fan, C., Yang, M., and Zhong, Z. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv prep...

work page arXiv
[22]

Aligning diffusion models by optimizing human utility

Li, S., Kallidromitis, K., Gokul, A., Kato, Y ., and Kozuka, K. Aligning diffusion models by optimizing human utility. arXiv preprint arXiv:2404.04465,

work page arXiv
[23]

Step-aware preference optimization: Aligning preference with denoising performance at each step.arXiv preprint arXiv:2406.04314, 2024

Liang, Y ., He, J., Li, G., Li, P., Klimovskiy, A., Carolan, N., Sun, J., Pont-Tuset, J., Young, S., Yang, F., et al. Rich human feedback for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19401–19411, 2024a. Liang, Z., Yuan, Y ., Gu, S., Chen, B., Hang, T., Li, J., and Zheng, L. Step-aw...

work page arXiv
[24]

Towards un- derstanding camera motions in any video.arXiv preprint arXiv:2504.15376,

Lin, Z., Cen, S., Jiang, D., Karhade, J., Wang, H., Mitra, C., Ling, T., Huang, Y ., Liu, S., Chen, M., et al. Towards un- derstanding camera motions in any video.arXiv preprint arXiv:2504.15376,

work page arXiv
[25]

Lin, Z., Mitra, C., Cen, S., Li, I., Huang, Y ., Ling, Y . T. T., Wang, H., Pi, I., Zhu, S., Rao, R., et al. Building a precise video language with human-ai oversight.arXiv preprint arXiv:2604.21718,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Flow Matching for Generative Modeling

Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Flow-GRPO: Training Flow Matching Models via Online RL

Liu, J., Liu, G., Liang, J., Li, Y ., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025a. Liu, J., Liu, G., Liang, J., Yuan, Z., Liu, X., Zheng, M., Wu, X., Wang, Q., Qin, W., Xia, M., et al. Improving video generation with human feedback.arXiv preprint arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Discovery of the reward function for embodied reinforcement learning agents.Nature Communications, 16(1):11064, 2025a

Lu, R., Shao, Z., Ding, Y ., Chen, R., Wu, D., Su, H., Yang, T., Zhang, F., Wang, J., Shi, Y ., et al. Discovery of the reward function for embodied reinforcement learning agents.Nature Communications, 16(1):11064, 2025a. Lu, Y ., Wang, Q., Cao, H., Wang, X., Xu, X., and Zhang, M. Inpo: Inversion preference optimization with reparametrized ddim for effici...

work page arXiv 2026
[29]

Step-video-t2v tech- nical report: The practice, challenges, and future of video foundation model

Ma, G., Huang, H., Yan, K., Chen, L., Duan, N., Yin, S., Wan, C., Ming, R., Song, X., Chen, X., et al. Step- video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248,

work page arXiv
[30]

Tuning timestep-distilled diffusion model using pairwise sample optimization.arXiv preprint arXiv:2410.03190,

Miao, Z., Yang, Z., Lin, K., Wang, Z., Liu, Z., Wang, L., and Qiu, Q. Tuning timestep-distilled diffusion model using pairwise sample optimization.arXiv preprint arXiv:2410.03190,

work page arXiv
[31]

J., Kang, W., and Moon, I.-C

Na, B., Park, M., Sim, G., Shin, D., Bae, H., Kang, M., Kwon, S. J., Kang, W., and Moon, I.-C. Diffusion adap- tive text embedding for text-to-image diffusion models. arXiv preprint arXiv:2510.23974,

work page arXiv
[32]

Boost your human image gen- eration model via direct preference optimization.arXiv preprint arXiv:2405.20216,

Na, S., Kim, Y ., and Lee, H. Boost your human image gen- eration model via direct preference optimization.arXiv preprint arXiv:2405.20216,

work page arXiv
[33]

Peebles, W. S. and Xie, S. Scalable diffusion models with transformers.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4172–4182,

work page 2023
[34]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Muller, J., Penna, J., and Rombach, R. Sdxl: Im- proving latent diffusion models for high-resolution image synthesis.ArXiv, abs/2307.01952,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Re- fining alignment framework for diffusion models with intermediate-step preference ranking.arXiv preprint arXiv:2502.01667,

Ren, J., Zhang, Y ., Liu, D., Zhang, X., and Tian, Q. Re- fining alignment framework for diffusion models with intermediate-step preference ranking.arXiv preprint arXiv:2502.01667,

work page arXiv
[36]

S., Shu, Z., Zhang, J., Jung, H., Gerig, G., and Zhang, H

Ren, M., Xiong, W., Yoon, J. S., Shu, Z., Zhang, J., Jung, H., Gerig, G., and Zhang, H. Relightful harmonization: Lighting-aware portrait background replacement.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6452–6462,

work page 2024
[37]

High-resolution image synthesis with la- tent diffusion models.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with la- tent diffusion models.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10674–10685,

work page 2022
[38]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Prioritize denois- ing steps on diffusion model preference alignment via explicit denoised distribution estimation.arXiv preprint arXiv:2411.14871,

Shi, D., Wang, Y ., Li, H., and Chu, X. Prioritize denois- ing steps on diffusion model preference alignment via explicit denoised distribution estimation.arXiv preprint arXiv:2411.14871,

work page arXiv
[40]

Denoising Diffusion Implicit Models

Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[41]

16 Submission and Formatting Instructions for ICML 2026 Tee, J. T. J., Yoon, H. S., Syarubany, A. H. M., Yoon, E., and Yoo, C. D. A gradient guidance perspective on stepwise preference optimization for diffusion models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. Wallace, B., Dang, M., Rafailov, R., Zhou, L., Lou, A., Pu...

work page 2026
[42]

Diffusion- npo: Negative preference optimization for better pref- erence aligned generation of diffusion models.arXiv preprint arXiv:2505.11245, 2025a

Wang, F.-Y ., Shui, Y ., Piao, J., Sun, K., and Li, H. Diffusion- npo: Negative preference optimization for better pref- erence aligned generation of diffusion models.arXiv preprint arXiv:2505.11245, 2025a. Wang, F.-Y ., Sun, K., Teng, Y ., Liu, X., Song, J., and Li, H. Self-npo: Negative preference optimization of diffusion models by simply learning from...

work page arXiv
[43]

URL https:// arxiv.org/abs/2605.03877. Wang, Z. J., Montoya, E., Munechika, D., Yang, H., Hoover, B., and Chau, D. H. Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models.arXiv preprint arXiv:2210.14896,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Human preference score: Better aligning text-to-image models with human preference

Wu, X., Sun, K., Zhu, F., Zhao, R., and Li, H. Human preference score: Better aligning text-to-image models with human preference. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2096– 2105,

work page 2096
[45]

DanceGRPO: Unleashing GRPO on Visual Generation

Xue, Z., Wu, J., Gao, Y ., Kong, F., Zhu, L., Chen, M., Liu, Z., Liu, W., Guo, Q., Huang, W., et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Using human feedback to fine-tune diffusion models without any reward model

Yang, K., Tao, J., Lyu, J., Ge, C., Chen, J., Shen, W., Zhu, X., and Li, X. Using human feedback to fine-tune diffusion models without any reward model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8941–8951, 2024a. Yang, S., Chen, T., and Zhou, M. A dense reward view on aligning text-to-image diffusion with pre...

work page arXiv
[47]

Rrhf: Rank responses to align language models with human feedback without tears

Yuan, Z., Yuan, H., Tan, C., Wang, W., Huang, S., and Huang, F. Rrhf: Rank responses to align language mod- els with human feedback without tears.arXiv preprint arXiv:2304.05302,

work page arXiv
[48]

Learning multi-dimensional human prefer- ence for text-to-image generation

Zhang, S., Wang, B., Wu, J., Li, Y ., Gao, T., Zhang, D., and Wang, Z. Learning multi-dimensional human prefer- ence for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8018–8027, 2024a. Zhang, T., Da, C., Ding, K., Yang, H., Jin, K., Li, Y ., Gao, T., Zhang, D., Xiang, S., and Pan, C. Dif...

work page arXiv
[49]

Large-scale reinforcement learning for diffusion models

Zhang, Y ., Tzeng, E., Du, Y ., and Kislyuk, D. Large-scale reinforcement learning for diffusion models. InEuro- pean Conference on Computer Vision, pp. 1–17. Springer, 2024b. Zheng, K., Chen, Y ., Chen, H., He, G., Liu, M.-Y ., Zhu, J., and Zhang, Q. Direct discriminative optimization: Your likelihood-based visual generative model is secretly a gan discr...

work page arXiv
[50]

Background A.1

17 Submission and Formatting Instructions for ICML 2026 A. Background A.1. More Related Works Conditional Generative Models.Diffusion models belong to a family of generative approaches that create data through an iterative denoising process. These models learn to reverse a predefined forward process that gradually adds noise to data. By capitalizing on ne...

work page 2026
[51]

However, PPO’s computational intensity and complex optimization landscape frequently pose implementation challenges

as its core algorithm, which necessitates simultaneous operation of multiple model components - including the active policy, reference model, value function estimator, and reward predictor. However, PPO’s computational intensity and complex optimization landscape frequently pose implementation challenges. To mitigate these issues, researchers have develop...

work page 2023
[52]

These advancements collectively represent significant progress in developing more efficient and theoretically grounded alignment techniques

enhances LLM training through comparative reward information. These advancements collectively represent significant progress in developing more efficient and theoretically grounded alignment techniques. Further Preference Optimization of Diffusion Models.The application of preference alignment techniques extends well beyond text-to-image diffusion models ...

work page 2024
[53]

Promising future directions (Cao et al., 2025a;b; Lin et al., 2026; 2025; Lu et al., 2025a; Wang et al.,

remains a nascent research area. Promising future directions (Cao et al., 2025a;b; Lin et al., 2026; 2025; Lu et al., 2025a; Wang et al.,

work page 2026
[54]

18 Submission and Formatting Instructions for ICML 2026 B

may involve transferring alignment methodologies from large language models to generative visual systems (Borso et al., 2025; Lee et al., 2025a; Zheng et al., 2025; Lu et al., 2025d), as well as expanding these techniques to novel sensory modalities including auditory and haptic domains (Huang et al., 2024; Shi et al., 2024). 18 Submission and Formatting ...

work page 2025
[55]

Details of the Primary Derivation In this section, we present a detailed derivation of our proposed method

19 Submission and Formatting Instructions for ICML 2026 C. Details of the Primary Derivation In this section, we present a detailed derivation of our proposed method. Following Diffusion-DPO, we define the reward on the whole chain: r(x0,c) =E pθ(x1:T |x0,c)[r(x0:T ,c)].(25) We begin with the objective function of RLHF: max pθ Ex0∼pθ(x0|c)[r(x0,c)]/β−D KL...

work page 2026
[56]

Exw 1:T−1 ∼pc θ (xw 1:T−1 |xw 0 ,xw T ) xl 1:T−1 ∼pc θ (xl 1:T−1 |xl 0,xl T ) log pc θ(xw 0:T ) pc ref (xw 0:T ) −log pc θ(xl 0:T ) pc ref (xl 0:T )   .(31) Given x∗ T , pc θ(x∗ 1:T−1 |x∗ 0,x ∗ T ) becomes tractable if we estimate it using pc θ(x∗ 1:T−1 |x∗ T ), though this approach is evidently resource-intensive. Leveraging the inherent straightness o...

work page 2026
[57]

Exw t−1,t∼q(xw t−1,t|xw 0 ,xw T ) xl t−1,t∼q(xl t−1,t|xl 0,xl T ) log pc θ(xw t−1|xw t ) pc ref (xw t−1|xw t ) −log pc θ(xl t−1|xl t) pc ref (xl t−1|xl t)   =−E D logσ βTE tExw T ∼pθ (xw T |xw 0 ),xl T ∼pθ (xl T |xl 0)Exw t ∼q(xw t |xw 0 ,xw T ),xl t∼q(xl t|xl 0,xl T ) Exw t ∼q(xw t−1|xw 0 ,xw t ,xw T ),xl t−1∼q(xl t−1|xl 0,xl t,xl T ) log pc θ(xw t−1|x...

work page 2026
[58]

E.2. Off-Policy Data Construction We present a subset of samples generated by FLUX as Figure 6 and Figure 7, with comparative assessments conducted using HPSv2.1 (Human Preference Score v2.1) as the evaluation metric. These visualizations demonstrate both the quality improvements achieved through rectification and the effectiveness of HPSv2.1 in discrimin...

work page 2026
[59]

Fly an airplane\

Mobile photo of a 747 plane crashing through traffic on the highway. Vintage lithograph of barack obama doing a fortnite dance. Vending machine for crack cocaine cigarette, soviet propaganda style. Floral wallpaper with orange pastel colors. Figure 6.Preference Dataset samples generated by FLUX. Both the quality improvements achieved through rectification...

work page 2026