pith. machine review for the scientific record. sign in

arxiv: 2604.15311 · v2 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords flow matchingmodel alignmentfine-tuningtrajectory shorteningreward gradientsimage generationODE sampling
0
0 comments X

The pith

LeapAlign fine-tunes flow matching models at any generation step by reducing trajectories to two randomized leaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Flow matching models generate images by following long ODE trajectories, but backpropagating reward signals through those full paths costs too much memory and produces exploding gradients, so early steps that set global image structure stay hard to update. LeapAlign shortens each trajectory to exactly two leaps, where each leap jumps over many sampling steps and directly predicts a later latent. Randomizing the start and end times of the leaps lets the reward gradient reach parameters at any point in the original path. The method further weights trajectories by how well they match the full long path and softens the largest gradient terms rather than dropping them. When applied to the Flux model this produces higher image quality and better text alignment than GRPO or earlier direct-gradient approaches.

Core claim

LeapAlign shortens the long generation trajectory of a flow matching model into two consecutive leaps. Each leap skips multiple ODE steps by predicting future latents in a single forward pass. Randomizing the start and end timesteps of these leaps produces short trajectories that still cover any desired point in the original sequence, allowing direct backpropagation of reward gradients to early generation steps. Trajectories more consistent with the full path receive higher training weight, and gradient terms with large magnitude are down-weighted instead of discarded. This combination yields stable updates that improve both global structure and final image quality when the Flux model is fin

What carries the argument

Two consecutive leaps that each skip multiple ODE steps, with randomized timestep boundaries and consistency-based weighting of the shortened trajectories.

If this is right

  • Early generation steps that control overall image layout become directly updatable by reward gradients.
  • Backpropagation memory cost drops from the full trajectory length to the cost of two leaps.
  • Gradient stability holds without completely removing large-magnitude terms.
  • Fine-tuned Flux models show better image quality and image-text alignment than GRPO or prior direct methods on standard metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The two-leap pattern may extend to other continuous-time generative models whose sampling follows long ODE paths.
  • Further tuning the randomization distribution could improve performance for particular reward models or data domains.
  • The same shortening idea might be tested on video or 3D generation tasks where early steps also set global coherence.

Load-bearing premise

Randomizing the start and end timesteps of the two leaps together with consistency weighting and partial gradient reduction preserves enough accurate signal from the full trajectory to update early steps without large bias or instability.

What would settle it

Run LeapAlign fine-tuning on Flux for a fixed number of steps and measure whether the resulting model improves metrics on prompts that require strong global structure compared with a GRPO baseline; if the gains disappear or memory use stays comparable to full-trajectory backprop, the method does not deliver its claimed advantage.

read the original abstract

This paper focuses on the alignment of flow matching models with human preferences. A promising way is fine-tuning by directly backpropagating reward gradients through the differentiable generation process of flow matching. However, backpropagating through long trajectories results in prohibitive memory costs and gradient explosion. Therefore, direct-gradient methods struggle to update early generation steps, which are crucial for determining the global structure of the final image. To address this issue, we introduce LeapAlign, a fine-tuning method that reduces computational cost and enables direct gradient propagation from reward to early generation steps. Specifically, we shorten the long trajectory into only two steps by designing two consecutive leaps, each skipping multiple ODE sampling steps and predicting future latents in a single step. By randomizing the start and end timesteps of the leaps, LeapAlign leads to efficient and stable model updates at any generation step. To better use such shortened trajectories, we assign higher training weights to those that are more consistent with the long generation path. To further enhance gradient stability, we reduce the weights of gradient terms with large magnitude, instead of completely removing them as done in previous works. When fine-tuning the Flux model, LeapAlign consistently outperforms state-of-the-art GRPO-based and direct-gradient methods across various metrics, achieving superior image quality and image-text alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that direct gradient backpropagation through long flow-matching trajectories is memory-prohibitive and prevents updates to early steps that control global image structure. LeapAlign addresses this by collapsing the trajectory to two randomized leaps (each skipping multiple ODE steps), applying consistency-based weighting to favor trajectories aligned with the full path, and down-weighting large-magnitude gradient terms. When applied to fine-tune Flux, the method is reported to outperform GRPO-based and prior direct-gradient baselines on image quality and text-alignment metrics.

Significance. If the two-leap approximation and weighting scheme preserve sufficient gradient signal without systematic bias, the approach would offer a practical route to memory-efficient preference alignment of large flow models, directly targeting the early-timestep updates that prior methods could not reach. The randomization-plus-consistency design is a concrete engineering contribution that could generalize beyond Flux.

major comments (3)
  1. [§3.2] §3.2 (Leap construction): the claim that randomizing leap start/end timesteps plus consistency weighting 'preserves gradient information from the full trajectory' lacks any sensitivity analysis or error bound on how the local leap Jacobians deviate from the integrated ODE sensitivity; without this, it is unclear whether early-step updates remain unbiased when leap length increases.
  2. [§4.2] §4.2 (Gradient magnitude reduction): replacing hard clipping with magnitude-based reweighting is presented as an improvement, yet no derivation or ablation shows that this choice avoids the same information loss that complete removal was intended to prevent, nor quantifies its effect on the effective learning rate for early timesteps.
  3. [§5.3] Experiments (Table 2 and §5.3): the reported superiority over GRPO and direct-gradient baselines is stated without error bars, run counts, or controls for hyperparameter search effort; if the consistency weights are tuned post-hoc on the same validation set used for final metrics, the cross-method comparison is not yet load-bearing.
minor comments (2)
  1. [§3.1] Notation for leap start/end timesteps (t_s, t_e) is introduced without an explicit diagram relating them to the original ODE discretization; a small schematic would clarify the randomization procedure.
  2. [Abstract] The abstract lists 'various metrics' without naming them; the experiments section should explicitly state the primary metrics (e.g., FID, CLIP score, human preference) in the first paragraph of §5.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, offering clarifications based on the design choices in LeapAlign and outlining revisions where they strengthen the presentation without altering the core claims.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Leap construction): the claim that randomizing leap start/end timesteps plus consistency weighting 'preserves gradient information from the full trajectory' lacks any sensitivity analysis or error bound on how the local leap Jacobians deviate from the integrated ODE sensitivity; without this, it is unclear whether early-step updates remain unbiased when leap length increases.

    Authors: We acknowledge that the manuscript does not include a formal sensitivity analysis or theoretical error bounds on the deviation of leap Jacobians from the full ODE sensitivity. The randomization of start and end timesteps is intended to average over a distribution of leap lengths, while consistency weighting selects trajectories whose local predictions align closely with the integrated path, thereby mitigating systematic bias in the backpropagated gradients to early steps. Our empirical results across different leap configurations support stable updates, but we agree this could be more rigorously quantified. We will add an empirical sensitivity analysis in the revision, including gradient norm statistics and performance metrics as a function of average leap length. revision: partial

  2. Referee: [§4.2] §4.2 (Gradient magnitude reduction): replacing hard clipping with magnitude-based reweighting is presented as an improvement, yet no derivation or ablation shows that this choice avoids the same information loss that complete removal was intended to prevent, nor quantifies its effect on the effective learning rate for early timesteps.

    Authors: The magnitude-based reweighting downscales large gradients proportionally rather than discarding them, which is motivated by retaining directional information that hard clipping removes. No closed-form derivation of the effective learning rate is provided in the current version. We will incorporate an ablation study comparing hard clipping, our reweighting, and alternatives, with explicit measurements of gradient magnitudes and update scales specifically for early timesteps to quantify the retained information. revision: yes

  3. Referee: [§5.3] Experiments (Table 2 and §5.3): the reported superiority over GRPO and direct-gradient baselines is stated without error bars, run counts, or controls for hyperparameter search effort; if the consistency weights are tuned post-hoc on the same validation set used for final metrics, the cross-method comparison is not yet load-bearing.

    Authors: We agree that error bars, run counts, and clearer hyperparameter controls are necessary for robust claims. The reported results were obtained from multiple independent training runs with different random seeds; we will update Table 2 to include means and standard deviations along with the exact number of runs. The consistency weights were selected via a separate held-out validation split that was not used for the final test metrics, and we will expand the experimental protocol section to document the full hyperparameter search procedure and confirm the separation of tuning and evaluation sets. revision: yes

Circularity Check

0 steps flagged

No circularity: LeapAlign is an independent engineering method with empirical validation

full rationale

The paper introduces LeapAlign as a practical solution to memory and gradient issues in flow-matching fine-tuning by shortening trajectories to two randomized leaps, applying consistency-based weighting, and magnitude-reduced gradients. No derivation chain, equation, or claim reduces by construction to fitted parameters, self-citations, or prior ansatzes from the authors. The central claims rest on the proposed design choices and reported experimental outperformance on Flux, which are presented as falsifiable engineering results rather than tautological redefinitions. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The abstract supplies limited technical detail, so the ledger is minimal; the method relies on standard flow matching assumptions rather than new invented entities or many fitted parameters.

free parameters (1)
  • leap consistency weights
    Higher weights assigned to trajectories more consistent with the long generation path; specific functional form or fitting procedure not detailed.
axioms (1)
  • domain assumption Flow matching generation can be approximated by two large leaps while still allowing useful gradient propagation to early timesteps
    This premise underpins the claim that the shortened trajectories enable updates at any generation step.

pith-pipeline@v0.9.0 · 5541 in / 1310 out tokens · 97798 ms · 2026-05-10T11:20:26.670833+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 38 canonical work pages · 23 internal anchors

  1. [1]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

  2. [2]

    To- wards self-improvement of diffusion models via group pref- erence optimization.arXiv preprint arXiv:2505.11070,

    Renjie Chen, Wenfeng Lin, Yichen Zhang, Jiangchuan Wei, Boyuan Liu, Chao Feng, Jiao Ran, and Mingyu Guo. Towards self-improvement of diffusion models via group preference optimization.arXiv preprint arXiv:2505.11070, 2025

  3. [3]

    Directly fine-tuning diffusion models on differentiable re- wards.arXiv preprint arXiv:2309.17400, 2023

    Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards. arXiv preprint arXiv:2309.17400, 2023

  4. [4]

    Vard: Efficient and dense fine-tuning for diffusion models with value-based rl.arXiv preprint arXiv:2505.15791, 2025

    Fengyuan Dai, Zifeng Zhuang, Yufei Huang, Siteng Huang, Bangyan Liao, Donglin Wang, and Fajie Yuan. Vard: Efficient and dense fine-tuning for diffusion models with value-based rl.arXiv preprint arXiv:2505.15791, 2025

  5. [5]

    Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control

    Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, and Ricky TQ Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control.arXiv preprint arXiv:2409.08861, 2024

  6. [6]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conference on machine learning, 2024

  7. [7]

    Reinforcement learning for fine-tuning text-to-image diffusion models

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Reinforcement learning for fine-tuning text-to-image diffusion models. Advancesin Neural Information Processing Systems, 36, 2024

  8. [8]

    One-step diffusion distillation via deep equilibrium models

    Zhengyang Geng, Ashwini Pokle, and J Zico Kolter. One-step diffusion distillation via deep equilibrium models. Advancesin Neural Information Processing Systems, 36:41914–41931, 2023

  9. [9]

    Mean Flows for One-step Generative Modeling

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. arXiv preprint arXiv:2505.13447, 2025

  10. [10]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advancesin Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advancesin Neural Information Processing Systems, 36:52132–52152, 2023

  11. [11]

    A simple and effective re- inforcement learning method for text-to-image diffusion fine-tuning.arXiv preprint arXiv:2503.00897, 2025

    Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin, Sreya Dutta Roy, Harrie Oosterhuis, Maarten de Rijke, and Satya Narayan Shukla. A simple and effective reinforcement learning method for text-to-image diffusion fine-tuning. arXiv preprint arXiv:2503.00897, 2025

  12. [12]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

  13. [13]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  14. [14]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

  15. [15]

    Margin-aware preference optimization for aligning diffusion models without reference

    Jiwoo Hong, Sayak Paul, Noah Lee, Kashif Rasul, James Thorne, and Jongheon Jeong. Margin-aware preference optimization for aligning diffusion models without reference. InFirst Workshop on Scalable Optimization for Efficientand Adaptive Foundation Models, 2024

  16. [16]

    Scalable ranked preference optimization for text-to-image generation

    Shyamgopal Karthik, Huseyin Coskun, Zeynep Akata, Sergey Tulyakov, Jian Ren, and Anil Kag. Scalable ranked preference optimization for text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18399–18410, 2025

  17. [17]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advancesin neural information processing systems, 36:36652–36663, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advancesin neural information processing systems, 36:36652–36663, 2023

  18. [18]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  19. [19]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

  20. [20]

    Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation

    Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Junfeng He, et al. Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation. In European Conference on Computer Vision, pages 462–478. Springer, 2024

  21. [21]

    Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation, 2024

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation, 2024

  22. [22]

    MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025

  23. [23]

    Aligning diffusion models by optimizing human utility.Advancesin Neural Information Processing Systems, 37:24897–24925, 2024

    Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Aligning diffusion models by optimizing human utility.Advancesin Neural Information Processing Systems, 37:24897–24925, 2024

  24. [24]

    arXiv preprint arXiv:2509.06040 (2025) 2, 3

    Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

  25. [25]

    Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization

    Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13199–13208, 2025

  26. [26]

    Step-level reward for free in rl-based t2i diffusion model fine-tuning

    Xinyao Liao, Wei Wei, Xiaoye Qu, and Yu Cheng. Step-level reward for free in rl-based t2i diffusion model fine-tuning. arXiv preprint arXiv:2505.19196, 2025

  27. [27]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

  28. [28]

    Flow Matching Guide and Code

    Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code.arXiv preprint arXiv:2412.06264, 2024

  29. [29]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

  30. [30]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  31. [31]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  32. [32]

    Hpsv3: Towards wide-spectrum human preference score

    Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

  33. [33]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesin neural information processing systems, 35:27730–27744, 2022

  34. [34]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  35. [35]

    Aligning text-to-image diffusion models with reward backpropagation, 2023

    Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation, 2023

  36. [36]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

  37. [37]

    Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advancesin neural information processing systems, 36:53728–53741, 2023

  38. [38]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  39. [39]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 13

  40. [40]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  41. [41]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

  42. [42]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  43. [43]

    Directly aligning the full diffusion trajectory with fine-grained human preference.arXiv preprint arXiv:2509.06942, 2025

    Xiangwei Shen, Zhimin Li, Zhantao Yang, Shiyi Zhang, Yingfang Zhang, Donghao Li, Chunyu Wang, Qinglin Lu, and Yansong Tang. Directly aligning the full diffusion trajectory with fine-grained human preference.arXiv preprint arXiv:2509.06942, 2025

  44. [44]

    Imagerefl: Balancing quality and diversity in human-aligned diffusion models.arXiv preprint arXiv:2505.22569, 2025

    Dmitrii Sorokin, Maksim Nakhodnov, Andrey Kuznetsov, and Aibek Alanov. Imagerefl: Balancing quality and diversity in human-aligned diffusion models.arXiv preprint arXiv:2505.22569, 2025

  45. [45]

    BalancedDPO: Adaptive Multi-Metric Alignment

    Dipesh Tamboli, Souradip Chakraborty, Aditya Malusare, Biplab Banerjee, Amrit Singh Bedi, and Vaneet Aggarwal. Balanceddpo: Adaptive multi-metric alignment.arXiv preprint arXiv:2503.12575, 2025

  46. [46]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

  47. [47]

    Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

    Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025

  48. [48]

    Unified Reward Model for Multimodal Understanding and Generation

    Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation. arXiv preprint arXiv:2503.05236, 2025

  49. [49]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992

  50. [50]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  51. [51]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

  52. [52]

    Deep reward supervisions for tuning text-to-image diffusion models

    Xiaoshi Wu, Yiming Hao, Manyuan Zhang, Keqiang Sun, Zhaoyang Huang, Guanglu Song, Yu Liu, and Hongsheng Li. Deep reward supervisions for tuning text-to-image diffusion models. InEuropean Conference on Computer Vision, pages 108–124. Springer, 2024

  53. [53]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

  54. [54]

    arXiv preprint arXiv:2509.25050 , year=

    Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, and Zhi-Ming Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025

  55. [55]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025

  56. [56]

    Using human feedback to fine-tune diffusion models without any reward model

    Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024

  57. [57]

    Using human feedback to fine-tune diffusion models without any reward model

    Shentao Yang, Tianqi Chen, and Mingyuan Zhou. A dense reward view on aligning text-to-image diffusion with preference. arXiv preprint arXiv:2402.08265, 2024

  58. [58]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024. 14

  59. [59]

    Self-play fine-tuning of diffusion models for text-to-image generation

    Huizhuo Yuan, Zixiang Chen, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning of diffusion models for text-to-image generation. Advancesin Neural Information Processing Systems, 37:73366–73398, 2024

  60. [60]

    SePPO: Semi-policy preference optimization for diffusion alignment,

    Daoan Zhang, Guangchen Lan, Dong-Jun Han, Wenlin Yao, Xiaoman Pan, Hongming Zhang, Mingxiao Li, Pengcheng Chen, Yu Dong, Christopher Brinton, et al. Seppo: Semi-policy preference optimization for diffusion alignment. arXiv preprint arXiv:2410.05255, 2024

  61. [61]

    Learning multi-dimensional human prefer- ence for text-to-image generation

    Tao Zhang, Cheng Da, Kun Ding, Huan Yang, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, and Chunhong Pan. Diffusion model as a noise-aware latent reward model for step-level preference optimization.arXiv preprint arXiv:2502.01051, 2025

  62. [62]

    Itercomp: Iterative composition-aware feedback learning from model gallery for text-to-image generation.arXiv preprint arXiv:2410.07171, 2024

    Xinchen Zhang, Ling Yang, Guohao Li, Yaqi Cai, Jiake Xie, Yong Tang, Yujiu Yang, Mengdi Wang, and Bin Cui. Itercomp: Iterative composition-aware feedback learning from model gallery for text-to-image generation.arXiv preprint arXiv:2410.07171, 2024

  63. [63]

    Large-scale reinforcement learning for diffusion models

    Yinan Zhang, Eric Tzeng, Yilun Du, and Dmitry Kislyuk. Large-scale reinforcement learning for diffusion models. In European Conference on Computer Vision, pages 1–17. Springer, 2024

  64. [64]

    DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025

  65. [65]

    Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation

    Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, and Hai Huang. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. InForty-firstInternational Conference on Machine Learning, 2024

  66. [66]

    arXiv preprint arXiv:2510.01982 (2025) 3

    Yujie Zhou, Pengyang Ling, Jiazi Bu, Yibin Wang, Yuhang Zang, Jiaqi Wang, Li Niu, and Guangtao Zhai. Fine-grained grpo for precise preference alignment in flow models.arXiv preprint arXiv:2510.01982, 2025. 15 Appendix A Visualization of GenEval Score Improvement During Fine-Tuning for Direct- Gradient Methods Figure 5 presents the GenEval score improvemen...