Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Mixed citations
Step-aware preference optimization: Aligning preference with denoising performance at each step
Mixed citation behavior. Most common role is background (40%).
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 8representative citing papers
PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
AdaScope adaptively selects optimal RL intervention points during diffusion denoising by monitoring structural and semantic changes, delivering 66% higher performance at 59% lower cost than full-trajectory RL baselines.
DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.
A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.
JS divergence in a unified f-divergence framework for GRPO-style T2I alignment yields competitive performance while preserving generation diversity.
BalancedDPO applies majority-vote consensus from multiple preference scorers and dynamic reference model updates within DPO to achieve multi-metric alignment for text-to-image diffusion models, reporting improved win rates on Pick-a-Pic, PartiPrompt, and HPD datasets across SD 1.5, 2.1, and SDXL.
citing papers explorer
-
Flow-GRPO: Training Flow Matching Models via Online RL
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
-
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
-
Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?
AdaScope adaptively selects optimal RL intervention points during diffusion denoising by monitoring structural and semantic changes, delivering 66% higher performance at 59% lower cost than full-trajectory RL baselines.
-
DanceGRPO: Unleashing GRPO on Visual Generation
DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.
-
Improving Video Generation with Human Feedback
A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
-
Towards General Preference Alignment: Diffusion Models at Nash Equilibrium
Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.
-
Balancing Performance and Diversity in GRPO Autoregressive Text-to-Image Post-Training
JS divergence in a unified f-divergence framework for GRPO-style T2I alignment yields competitive performance while preserving generation diversity.
-
BalancedDPO: Adaptive Multi-Metric Alignment
BalancedDPO applies majority-vote consensus from multiple preference scorers and dynamic reference model updates within DPO to achieve multi-metric alignment for text-to-image diffusion models, reporting improved win rates on Pick-a-Pic, PartiPrompt, and HPD datasets across SD 1.5, 2.1, and SDXL.