hub

Directly fine-tuning diffusion models on differentiable rewards

Kevin Clark, Paul Vicol, Kevin Swersky, David J Fleet · 2023 · arXiv 2309.17400

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.

Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Adaptive guidance trajectories learned via PPO outperform fixed-scale CFG on controllability-quality balance in three controlled NLP generation tasks with discrete diffusion models.

Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

A new adjoint matching framework formulates flow model alignment as optimal control, enabling direct regression training and terminal-trajectory truncation for efficiency gains on models like SiT-XL and FLUX.

How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

cs.LG · 2026-04-29 · unverdicted · novelty 7.0

FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.

Long-Text-to-Image Generation via Compositional Prompt Decomposition

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models while generalizing better to prompts over 500 tokens.

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

Step-level Denoising-time Diffusion Alignment with Multiple Objectives

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

MSDDA derives a closed-form optimal reverse denoising distribution for multi-objective diffusion alignment that is exactly equivalent to step-level RL fine-tuning with no approximation error.

CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.

Response Time Enhances Alignment with Heterogeneous Preferences

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.

Bias at the End of the Score

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

Reward models used as quality scorers in text-to-image generation encode demographic biases that cause reward-guided training to sexualize female subjects, reinforce stereotypes, and reduce diversity.

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

cs.LG · 2026-04-08 · unverdicted · novelty 6.0

Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.

VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion

cs.AI · 2026-04-08 · unverdicted · novelty 6.0 · 2 refs

VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and MCTS methods.

Improving Video Generation with Human Feedback

cs.CV · 2025-01-23 · unverdicted · novelty 6.0

A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.

Towards General Preference Alignment: Diffusion Models at Nash Equilibrium

cs.LG · 2026-05-06 · unverdicted · novelty 5.0

Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.

citing papers explorer

Showing 16 of 16 citing papers.

Flow-GRPO: Training Flow Matching Models via Online RL cs.CV · 2025-05-08 · unverdicted · none · ref 31
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs cs.CV · 2026-05-10 · unverdicted · none · ref 4
PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 31
Adaptive guidance trajectories learned via PPO outperform fixed-scale CFG on controllability-quality balance in three controlled NLP generation tasks with discrete diffusion models.
Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline cs.AI · 2026-05-07 · unverdicted · none · ref 6
A new adjoint matching framework formulates flow model alignment as optimal control, enabling direct regression training and terminal-trajectory truncation for efficiency gains on models like SiT-XL and FLUX.
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance cs.LG · 2026-04-29 · unverdicted · none · ref 7
FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.
Long-Text-to-Image Generation via Compositional Prompt Decomposition cs.CV · 2026-04-20 · unverdicted · none · ref 57
PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models while generalizing better to prompts over 500 tokens.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories cs.CV · 2026-04-16 · unverdicted · none · ref 3
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
Step-level Denoising-time Diffusion Alignment with Multiple Objectives cs.LG · 2026-04-15 · unverdicted · none · ref 5
MSDDA derives a closed-form optimal reverse denoising distribution for multi-objective diffusion alignment that is exactly equivalent to step-level RL fine-tuning with no approximation error.
CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis cs.CV · 2026-05-12 · unverdicted · none · ref 7
CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping cs.CV · 2026-05-11 · unverdicted · none · ref 78
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
Response Time Enhances Alignment with Heterogeneous Preferences cs.LG · 2026-05-07 · unverdicted · none · ref 95
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
Bias at the End of the Score cs.CV · 2026-04-14 · unverdicted · none · ref 11
Reward models used as quality scorers in text-to-image generation encode demographic biases that cause reward-guided training to sexualize female subjects, reinforce stereotypes, and reduce diversity.
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling cs.LG · 2026-04-08 · unverdicted · none · ref 29
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion cs.AI · 2026-04-08 · unverdicted · none · ref 13 · 2 links
VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and MCTS methods.
Improving Video Generation with Human Feedback cs.CV · 2025-01-23 · unverdicted · none · ref 9
A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
Towards General Preference Alignment: Diffusion Models at Nash Equilibrium cs.LG · 2026-05-06 · unverdicted · none · ref 5
Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.

Directly fine-tuning diffusion models on differentiable rewards

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer