Self-distillation from a caption-conditioned video diffusion model to an image-and-prompt-conditioned executor, enhanced by RL from VLM feedback, enables task solving in world models.
hub Mixed citations
TempFlow-GRPO: When Timing Matters for GRPO in Flow Models
Mixed citation behavior. Most common role is background (67%).
abstract
Recent flow matching models for text-to-image generation have achieved remarkable quality, yet their integration with reinforcement learning for human preference alignment remains suboptimal, hindering fine-grained reward-based optimization. We observe that the key impediment to effective GRPO training of flow models is the temporal uniformity assumption in existing approaches: sparse terminal rewards with uniform credit assignment fail to capture the varying criticality of decisions across generation timesteps, resulting in inefficient exploration and suboptimal convergence. To remedy this shortcoming, we introduce \textbf{TempFlow-GRPO} (Temporal Flow GRPO), a principled GRPO framework that captures and exploits the temporal structure inherent in flow-based generation. TempFlow-GRPO introduces three key innovations: (i) a trajectory branching mechanism that provides process rewards by concentrating stochasticity at designated branching points, enabling precise credit assignment without requiring specialized intermediate reward models; (ii) a noise-aware weighting scheme that modulates policy optimization according to the intrinsic exploration potential of each timestep, prioritizing learning during high-impact early stages while ensuring stable refinement in later phases; and (iii) a seed group strategy that controls for initialization effects to isolate exploration contributions. These innovations endow the model with temporally-aware optimization that respects the underlying generative dynamics, leading to state-of-the-art performance in human preference alignment and text-to-image benchmarks.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
DRM turns a pre-trained diffusion model into a step-wise reward model and uses it for dense RL training (Step-wise GRPO) and guided sampling to improve final image quality.
OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
Pave-GRPO reformulates GRPO via principled average velocity decomposition to enable denser temporal supervision in flow-based generative model alignment without increasing rollout cost.
AdvantageFlow proposes an advantage-weighted forward-process least-squares loss for RL in rectified flow models, stabilized by rollout policy regularization, and reports better image generation performance than Flow-GRPO on Stable Diffusion 3.5.
Flash-GRPO is a one-step GRPO framework for video diffusion alignment that applies iso-temporal grouping and temporal gradient rectification to achieve higher alignment quality and stability than full-trajectory training under low compute budgets on 1.3B-14B models.
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
An ELBO-based likelihood estimator from the final generated sample dominates other RL design factors for diffusion models, raising GenEval from 0.24 to 0.95 in 90 GPU hours with better efficiency than prior methods.
DP-DMD preserves sample diversity in few-step image synthesis by applying a teacher-derived target-prediction objective to the first distillation step and standard DMD loss to the rest.
ViPO enhances GRPO for visual generation by creating spatially and temporally aware advantage maps from pretrained vision models to focus optimization on perceptually important regions.
Dynamic-TreeRPO replaces independent trajectory sampling with a tree-structured search using dynamic noise intensities and integrates SFT into RL via a weighted Progress Reward Model to achieve better semantic consistency and efficiency in text-to-image generation.
AdaGRPO enhances GRPO for flow models via online curriculum filtering of prompts and cross-level advantage fusion, yielding performance gains and training stability.
E²PO uses embedding-level perturbations to maintain intra-group variance and discriminative signal in RL-based preference optimization for generative flow models.
A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.
GCPO shifts RL policy optimization for flow matching from step-level to chunk-level grouping of consecutive denoising steps, reporting up to 43% relative gains over GRPO on T2I benchmarks and preference tasks.
Pref-GRPO stabilizes T2I RL training by using pairwise win rates from preference models as rewards instead of normalized pointwise scores, while UniGenBench enables finer-grained model evaluation across themes and criteria.
citing papers explorer
-
Reward-Aware Trajectory Shaping for Few-step Visual Generation
RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.
- UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models