arxiv: 2604.15311 · v2 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

Zhanhao Liang , Tao Yang , Jie Wu , Chengjian Feng , Liang Zheng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords flow matchingmodel alignmentfine-tuningtrajectory shorteningreward gradientsimage generationODE sampling

0 comments

The pith

LeapAlign fine-tunes flow matching models at any generation step by reducing trajectories to two randomized leaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Flow matching models generate images by following long ODE trajectories, but backpropagating reward signals through those full paths costs too much memory and produces exploding gradients, so early steps that set global image structure stay hard to update. LeapAlign shortens each trajectory to exactly two leaps, where each leap jumps over many sampling steps and directly predicts a later latent. Randomizing the start and end times of the leaps lets the reward gradient reach parameters at any point in the original path. The method further weights trajectories by how well they match the full long path and softens the largest gradient terms rather than dropping them. When applied to the Flux model this produces higher image quality and better text alignment than GRPO or earlier direct-gradient approaches.

Core claim

LeapAlign shortens the long generation trajectory of a flow matching model into two consecutive leaps. Each leap skips multiple ODE steps by predicting future latents in a single forward pass. Randomizing the start and end timesteps of these leaps produces short trajectories that still cover any desired point in the original sequence, allowing direct backpropagation of reward gradients to early generation steps. Trajectories more consistent with the full path receive higher training weight, and gradient terms with large magnitude are down-weighted instead of discarded. This combination yields stable updates that improve both global structure and final image quality when the Flux model is fin

What carries the argument

Two consecutive leaps that each skip multiple ODE steps, with randomized timestep boundaries and consistency-based weighting of the shortened trajectories.

If this is right

Early generation steps that control overall image layout become directly updatable by reward gradients.
Backpropagation memory cost drops from the full trajectory length to the cost of two leaps.
Gradient stability holds without completely removing large-magnitude terms.
Fine-tuned Flux models show better image quality and image-text alignment than GRPO or prior direct methods on standard metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The two-leap pattern may extend to other continuous-time generative models whose sampling follows long ODE paths.
Further tuning the randomization distribution could improve performance for particular reward models or data domains.
The same shortening idea might be tested on video or 3D generation tasks where early steps also set global coherence.

Load-bearing premise

Randomizing the start and end timesteps of the two leaps together with consistency weighting and partial gradient reduction preserves enough accurate signal from the full trajectory to update early steps without large bias or instability.

What would settle it

Run LeapAlign fine-tuning on Flux for a fixed number of steps and measure whether the resulting model improves metrics on prompts that require strong global structure compared with a GRPO baseline; if the gains disappear or memory use stays comparable to full-trajectory backprop, the method does not deliver its claimed advantage.

read the original abstract

This paper focuses on the alignment of flow matching models with human preferences. A promising way is fine-tuning by directly backpropagating reward gradients through the differentiable generation process of flow matching. However, backpropagating through long trajectories results in prohibitive memory costs and gradient explosion. Therefore, direct-gradient methods struggle to update early generation steps, which are crucial for determining the global structure of the final image. To address this issue, we introduce LeapAlign, a fine-tuning method that reduces computational cost and enables direct gradient propagation from reward to early generation steps. Specifically, we shorten the long trajectory into only two steps by designing two consecutive leaps, each skipping multiple ODE sampling steps and predicting future latents in a single step. By randomizing the start and end timesteps of the leaps, LeapAlign leads to efficient and stable model updates at any generation step. To better use such shortened trajectories, we assign higher training weights to those that are more consistent with the long generation path. To further enhance gradient stability, we reduce the weights of gradient terms with large magnitude, instead of completely removing them as done in previous works. When fine-tuning the Flux model, LeapAlign consistently outperforms state-of-the-art GRPO-based and direct-gradient methods across various metrics, achieving superior image quality and image-text alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LeapAlign shortens flow trajectories to two randomized leaps for direct preference gradients, which is a useful engineering move but rests on thin experimental detail.

read the letter

LeapAlign targets the memory and gradient problems that come up when you try to backprop human preference rewards straight through a full flow matching ODE path. The fix is to collapse the long trajectory into two big leaps, each jumping over multiple steps, with randomized start and end times plus consistency weighting to pick trajectories that stay close to the original path. They also downweight large-magnitude gradient terms instead of dropping them outright. This is meant to let early generation steps get updated without the full compute cost, which matters because those steps set global image structure in models like Flux. The core idea is new in how it combines the two-leap construction with the randomization and weighting scheme for flow matching specifically. It does a clear job naming the practical bottleneck and showing a way around it that keeps the method simple enough to implement. The reported gains on Flux for image quality and text alignment over GRPO and prior direct-gradient baselines are the main evidence offered. The soft spots sit in the validation. The abstract gives no ablations on leap length, no error bars, and no check on whether the shortened paths actually preserve the right gradient information for early timesteps. If the leaps systematically miss parts of the full-path sensitivity, the updates could be biased even with the weighting, which would weaken the superiority claim. The paper is aimed at people already working on preference alignment for flow or diffusion models who need scalable direct-gradient methods. Readers who run similar fine-tuning experiments will find the trick worth trying. It deserves a serious referee because the engineering problem is real and the proposed solution is distinct enough to merit checking, though it will need stronger experiments and analysis to hold up.

Referee Report

3 major / 2 minor

Summary. The paper claims that direct gradient backpropagation through long flow-matching trajectories is memory-prohibitive and prevents updates to early steps that control global image structure. LeapAlign addresses this by collapsing the trajectory to two randomized leaps (each skipping multiple ODE steps), applying consistency-based weighting to favor trajectories aligned with the full path, and down-weighting large-magnitude gradient terms. When applied to fine-tune Flux, the method is reported to outperform GRPO-based and prior direct-gradient baselines on image quality and text-alignment metrics.

Significance. If the two-leap approximation and weighting scheme preserve sufficient gradient signal without systematic bias, the approach would offer a practical route to memory-efficient preference alignment of large flow models, directly targeting the early-timestep updates that prior methods could not reach. The randomization-plus-consistency design is a concrete engineering contribution that could generalize beyond Flux.

major comments (3)

[§3.2] §3.2 (Leap construction): the claim that randomizing leap start/end timesteps plus consistency weighting 'preserves gradient information from the full trajectory' lacks any sensitivity analysis or error bound on how the local leap Jacobians deviate from the integrated ODE sensitivity; without this, it is unclear whether early-step updates remain unbiased when leap length increases.
[§4.2] §4.2 (Gradient magnitude reduction): replacing hard clipping with magnitude-based reweighting is presented as an improvement, yet no derivation or ablation shows that this choice avoids the same information loss that complete removal was intended to prevent, nor quantifies its effect on the effective learning rate for early timesteps.
[§5.3] Experiments (Table 2 and §5.3): the reported superiority over GRPO and direct-gradient baselines is stated without error bars, run counts, or controls for hyperparameter search effort; if the consistency weights are tuned post-hoc on the same validation set used for final metrics, the cross-method comparison is not yet load-bearing.

minor comments (2)

[§3.1] Notation for leap start/end timesteps (t_s, t_e) is introduced without an explicit diagram relating them to the original ODE discretization; a small schematic would clarify the randomization procedure.
[Abstract] The abstract lists 'various metrics' without naming them; the experiments section should explicitly state the primary metrics (e.g., FID, CLIP score, human preference) in the first paragraph of §5.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, offering clarifications based on the design choices in LeapAlign and outlining revisions where they strengthen the presentation without altering the core claims.

read point-by-point responses

Referee: [§3.2] §3.2 (Leap construction): the claim that randomizing leap start/end timesteps plus consistency weighting 'preserves gradient information from the full trajectory' lacks any sensitivity analysis or error bound on how the local leap Jacobians deviate from the integrated ODE sensitivity; without this, it is unclear whether early-step updates remain unbiased when leap length increases.

Authors: We acknowledge that the manuscript does not include a formal sensitivity analysis or theoretical error bounds on the deviation of leap Jacobians from the full ODE sensitivity. The randomization of start and end timesteps is intended to average over a distribution of leap lengths, while consistency weighting selects trajectories whose local predictions align closely with the integrated path, thereby mitigating systematic bias in the backpropagated gradients to early steps. Our empirical results across different leap configurations support stable updates, but we agree this could be more rigorously quantified. We will add an empirical sensitivity analysis in the revision, including gradient norm statistics and performance metrics as a function of average leap length. revision: partial
Referee: [§4.2] §4.2 (Gradient magnitude reduction): replacing hard clipping with magnitude-based reweighting is presented as an improvement, yet no derivation or ablation shows that this choice avoids the same information loss that complete removal was intended to prevent, nor quantifies its effect on the effective learning rate for early timesteps.

Authors: The magnitude-based reweighting downscales large gradients proportionally rather than discarding them, which is motivated by retaining directional information that hard clipping removes. No closed-form derivation of the effective learning rate is provided in the current version. We will incorporate an ablation study comparing hard clipping, our reweighting, and alternatives, with explicit measurements of gradient magnitudes and update scales specifically for early timesteps to quantify the retained information. revision: yes
Referee: [§5.3] Experiments (Table 2 and §5.3): the reported superiority over GRPO and direct-gradient baselines is stated without error bars, run counts, or controls for hyperparameter search effort; if the consistency weights are tuned post-hoc on the same validation set used for final metrics, the cross-method comparison is not yet load-bearing.

Authors: We agree that error bars, run counts, and clearer hyperparameter controls are necessary for robust claims. The reported results were obtained from multiple independent training runs with different random seeds; we will update Table 2 to include means and standard deviations along with the exact number of runs. The consistency weights were selected via a separate held-out validation split that was not used for the final test metrics, and we will expand the experimental protocol section to document the full hyperparameter search procedure and confirm the separation of tuning and evaluation sets. revision: yes

Circularity Check

0 steps flagged

No circularity: LeapAlign is an independent engineering method with empirical validation

full rationale

The paper introduces LeapAlign as a practical solution to memory and gradient issues in flow-matching fine-tuning by shortening trajectories to two randomized leaps, applying consistency-based weighting, and magnitude-reduced gradients. No derivation chain, equation, or claim reduces by construction to fitted parameters, self-citations, or prior ansatzes from the authors. The central claims rest on the proposed design choices and reported experimental outperformance on Flux, which are presented as falsifiable engineering results rather than tautological redefinitions. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The abstract supplies limited technical detail, so the ledger is minimal; the method relies on standard flow matching assumptions rather than new invented entities or many fitted parameters.

free parameters (1)

leap consistency weights
Higher weights assigned to trajectories more consistent with the long generation path; specific functional form or fitting procedure not detailed.

axioms (1)

domain assumption Flow matching generation can be approximated by two large leaps while still allowing useful gradient propagation to early timesteps
This premise underpins the claim that the shortened trajectories enable updates at any generation step.

pith-pipeline@v0.9.0 · 5541 in / 1310 out tokens · 97798 ms · 2026-05-10T11:20:26.670833+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 38 canonical work pages · 23 internal anchors

[1]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

work page internal anchor Pith review arXiv 2023
[2]

To- wards self-improvement of diffusion models via group pref- erence optimization.arXiv preprint arXiv:2505.11070,

Renjie Chen, Wenfeng Lin, Yichen Zhang, Jiangchuan Wei, Boyuan Liu, Chao Feng, Jiao Ran, and Mingyu Guo. Towards self-improvement of diffusion models via group preference optimization.arXiv preprint arXiv:2505.11070, 2025

work page arXiv 2025
[3]

Directly fine-tuning diffusion models on differentiable re- wards.arXiv preprint arXiv:2309.17400, 2023

Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards. arXiv preprint arXiv:2309.17400, 2023

work page arXiv 2023
[4]

Vard: Efficient and dense fine-tuning for diffusion models with value-based rl.arXiv preprint arXiv:2505.15791, 2025

Fengyuan Dai, Zifeng Zhuang, Yufei Huang, Siteng Huang, Bangyan Liao, Donglin Wang, and Fajie Yuan. Vard: Efficient and dense fine-tuning for diffusion models with value-based rl.arXiv preprint arXiv:2505.15791, 2025

work page arXiv 2025
[5]

Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control

Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, and Ricky TQ Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control.arXiv preprint arXiv:2409.08861, 2024

work page arXiv 2024
[6]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conference on machine learning, 2024

2024
[7]

Reinforcement learning for fine-tuning text-to-image diffusion models

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Reinforcement learning for fine-tuning text-to-image diffusion models. Advancesin Neural Information Processing Systems, 36, 2024

2024
[8]

One-step diffusion distillation via deep equilibrium models

Zhengyang Geng, Ashwini Pokle, and J Zico Kolter. One-step diffusion distillation via deep equilibrium models. Advancesin Neural Information Processing Systems, 36:41914–41931, 2023

2023
[9]

Mean Flows for One-step Generative Modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. arXiv preprint arXiv:2505.13447, 2025

work page internal anchor Pith review arXiv 2025
[10]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advancesin Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advancesin Neural Information Processing Systems, 36:52132–52152, 2023

2023
[11]

A simple and effective re- inforcement learning method for text-to-image diffusion fine-tuning.arXiv preprint arXiv:2503.00897, 2025

Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin, Sreya Dutta Roy, Harrie Oosterhuis, Maarten de Rijke, and Satya Narayan Shukla. A simple and effective reinforcement learning method for text-to-image diffusion fine-tuning. arXiv preprint arXiv:2503.00897, 2025

work page arXiv 2025
[12]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review arXiv 2022
[13]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review arXiv 2022
[14]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

2020
[15]

Margin-aware preference optimization for aligning diffusion models without reference

Jiwoo Hong, Sayak Paul, Noah Lee, Kashif Rasul, James Thorne, and Jongheon Jeong. Margin-aware preference optimization for aligning diffusion models without reference. InFirst Workshop on Scalable Optimization for Efficientand Adaptive Foundation Models, 2024

2024
[16]

Scalable ranked preference optimization for text-to-image generation

Shyamgopal Karthik, Huseyin Coskun, Zeynep Akata, Sergey Tulyakov, Jian Ren, and Anil Kag. Scalable ranked preference optimization for text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18399–18410, 2025

2025
[17]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advancesin neural information processing systems, 36:36652–36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advancesin neural information processing systems, 36:36652–36663, 2023

2023
[18]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024
[19]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

work page internal anchor Pith review arXiv 2025
[20]

Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation

Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Junfeng He, et al. Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation. In European Conference on Computer Vision, pages 462–478. Springer, 2024

2024
[21]

Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation, 2024

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation, 2024

2024
[22]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025

work page internal anchor Pith review arXiv 2025
[23]

Aligning diffusion models by optimizing human utility.Advancesin Neural Information Processing Systems, 37:24897–24925, 2024

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Aligning diffusion models by optimizing human utility.Advancesin Neural Information Processing Systems, 37:24897–24925, 2024

2024
[24]

arXiv preprint arXiv:2509.06040 (2025) 2, 3

Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

work page arXiv 2025
[25]

Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization

Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13199–13208, 2025

2025
[26]

Step-level reward for free in rl-based t2i diffusion model fine-tuning

Xinyao Liao, Wei Wei, Xiaoye Qu, and Yu Cheng. Step-level reward for free in rl-based t2i diffusion model fine-tuning. arXiv preprint arXiv:2505.19196, 2025

work page arXiv 2025
[27]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Flow Matching Guide and Code

Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code.arXiv preprint arXiv:2412.06264, 2024

work page internal anchor Pith review arXiv 2024
[29]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review arXiv 2025
[30]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review arXiv 2022
[31]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

Hpsv3: Towards wide-spectrum human preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

2025
[33]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesin neural information processing systems, 35:27730–27744, 2022

2022
[34]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review arXiv 2023
[35]

Aligning text-to-image diffusion models with reward backpropagation, 2023

Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation, 2023

2023
[36]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[37]

Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

2023
[38]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[39]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 13

work page internal anchor Pith review arXiv 2022
[40]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review arXiv 2025
[42]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Directly aligning the full diffusion trajectory with fine-grained human preference.arXiv preprint arXiv:2509.06942, 2025

Xiangwei Shen, Zhimin Li, Zhantao Yang, Shiyi Zhang, Yingfang Zhang, Donghao Li, Chunyu Wang, Qinglin Lu, and Yansong Tang. Directly aligning the full diffusion trajectory with fine-grained human preference.arXiv preprint arXiv:2509.06942, 2025

work page arXiv 2025
[44]

Imagerefl: Balancing quality and diversity in human-aligned diffusion models.arXiv preprint arXiv:2505.22569, 2025

Dmitrii Sorokin, Maksim Nakhodnov, Andrey Kuznetsov, and Aibek Alanov. Imagerefl: Balancing quality and diversity in human-aligned diffusion models.arXiv preprint arXiv:2505.22569, 2025

work page arXiv 2025
[45]

BalancedDPO: Adaptive Multi-Metric Alignment

Dipesh Tamboli, Souradip Chakraborty, Aditya Malusare, Biplab Banerjee, Amrit Singh Bedi, and Vaneet Aggarwal. Balanceddpo: Adaptive multi-metric alignment.arXiv preprint arXiv:2503.12575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

2024
[47]

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Unified Reward Model for Multimodal Understanding and Generation

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation. arXiv preprint arXiv:2503.05236, 2025

work page internal anchor Pith review arXiv 2025
[49]

Simple statistical gradient-following algorithms for connectionist reinforcement learning

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992

1992
[50]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review arXiv 2025
[51]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review arXiv 2023
[52]

Deep reward supervisions for tuning text-to-image diffusion models

Xiaoshi Wu, Yiming Hao, Manyuan Zhang, Keqiang Sun, Zhaoyang Huang, Guanglu Song, Yu Liu, and Hongsheng Li. Deep reward supervisions for tuning text-to-image diffusion models. InEuropean Conference on Computer Vision, pages 108–124. Springer, 2024

2024
[53]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

2023
[54]

arXiv preprint arXiv:2509.25050 , year=

Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, and Zhi-Ming Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025

work page arXiv 2025
[55]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review arXiv 2025
[56]

Using human feedback to fine-tune diffusion models without any reward model

Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024

2024
[57]

Using human feedback to fine-tune diffusion models without any reward model

Shentao Yang, Tianqi Chen, and Mingyuan Zhou. A dense reward view on aligning text-to-image diffusion with preference. arXiv preprint arXiv:2402.08265, 2024

work page arXiv 2024
[58]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024. 14

2024
[59]

Self-play fine-tuning of diffusion models for text-to-image generation

Huizhuo Yuan, Zixiang Chen, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning of diffusion models for text-to-image generation. Advancesin Neural Information Processing Systems, 37:73366–73398, 2024

2024
[60]

SePPO: Semi-policy preference optimization for diffusion alignment,

Daoan Zhang, Guangchen Lan, Dong-Jun Han, Wenlin Yao, Xiaoman Pan, Hongming Zhang, Mingxiao Li, Pengcheng Chen, Yu Dong, Christopher Brinton, et al. Seppo: Semi-policy preference optimization for diffusion alignment. arXiv preprint arXiv:2410.05255, 2024

work page arXiv 2024
[61]

Learning multi-dimensional human prefer- ence for text-to-image generation

Tao Zhang, Cheng Da, Kun Ding, Huan Yang, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, and Chunhong Pan. Diffusion model as a noise-aware latent reward model for step-level preference optimization.arXiv preprint arXiv:2502.01051, 2025

work page arXiv 2025
[62]

Itercomp: Iterative composition-aware feedback learning from model gallery for text-to-image generation.arXiv preprint arXiv:2410.07171, 2024

Xinchen Zhang, Ling Yang, Guohao Li, Yaqi Cai, Jiake Xie, Yong Tang, Yujiu Yang, Mengdi Wang, and Bin Cui. Itercomp: Iterative composition-aware feedback learning from model gallery for text-to-image generation.arXiv preprint arXiv:2410.07171, 2024

work page arXiv 2024
[63]

Large-scale reinforcement learning for diffusion models

Yinan Zhang, Eric Tzeng, Yilun Du, and Dmitry Kislyuk. Large-scale reinforcement learning for diffusion models. In European Conference on Computer Vision, pages 1–17. Springer, 2024

2024
[64]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025

work page internal anchor Pith review arXiv 2025
[65]

Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation

Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, and Hai Huang. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. InForty-firstInternational Conference on Machine Learning, 2024

2024
[66]

arXiv preprint arXiv:2510.01982 (2025) 3

Yujie Zhou, Pengyang Ling, Jiazi Bu, Yibin Wang, Yuhang Zang, Jiaqi Wang, Li Niu, and Guangtao Zhai. Fine-grained grpo for precise preference alignment in flow models.arXiv preprint arXiv:2510.01982, 2025. 15 Appendix A Visualization of GenEval Score Improvement During Fine-Tuning for Direct- Gradient Methods Figure 5 presents the GenEval score improvemen...

work page arXiv 2025