pith. sign in

arxiv: 2505.05470 · v5 · submitted 2025-05-08 · 💻 cs.CV · cs.AI

Flow-GRPO: Training Flow Matching Models via Online RL

Pith reviewed 2026-05-11 18:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords flow matchingreinforcement learningpolicy gradienttext-to-image generationODE to SDEdenoising reductiongenerative models
0
0 comments X

The pith

Flow matching models can be trained with online policy gradient reinforcement learning by converting their ODE to an equivalent SDE with identical marginals at every timestep and by reducing denoising steps during training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to apply online reinforcement learning to flow matching models used for image generation. It converts the deterministic ODE trajectory into a stochastic differential equation whose probability distribution matches the original model exactly at each timestep, which supplies valid samples for the RL exploration process. A separate strategy shortens the number of denoising steps used while the model is being trained, yet leaves the full number of steps available at inference time. This combination produces large gains on tasks that require precise control over object counts, spatial arrangements, and embedded text. The resulting models show improved alignment with human preferences while exhibiting little reward hacking that would trade away image quality or diversity.

Core claim

Flow matching models, which learn a velocity field along deterministic ODE paths, become amenable to online policy-gradient RL once their ODE is converted to an SDE whose marginal distribution equals the flow model's distribution at every timestep; a denoising-reduction schedule then accelerates the RL updates without altering the original inference procedure.

What carries the argument

The ODE-to-SDE conversion that produces an SDE whose marginal distribution exactly matches the original flow-matching model at every timestep, thereby supplying statistically valid trajectories for RL exploration.

If this is right

  • Generative performance improves substantially on tasks requiring accurate object counts, spatial relations, and fine-grained attributes.
  • Accuracy on visual text rendering increases markedly.
  • Alignment with human preferences rises while image quality and diversity remain largely intact.
  • The same procedure applies across multiple text-to-image tasks with minimal reward hacking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ODE-to-SDE device could be applied to other deterministic continuous-time generative models to enable RL fine-tuning.
  • The training-time efficiency gain from fewer denoising steps may make the approach practical for higher-resolution or longer-horizon models.
  • Reward functions could be engineered to target specific remaining failure modes such as counting errors or text legibility without retraining from scratch.

Load-bearing premise

The ODE-to-SDE conversion produces an SDE whose marginal distribution exactly matches the original flow-matching model at every timestep.

What would settle it

Sampling states from the converted SDE at an intermediate timestep and finding that their distribution differs from the distribution of states reached by integrating the original ODE to the same timestep would show that the RL policy is being trained on invalid data.

read the original abstract

We propose Flow-GRPO, the first method to integrate online policy gradient reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original number of inference steps, significantly improving sampling efficiency without sacrificing performance. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For compositional generation, RL-tuned SD3.5-M generates nearly perfect object counts, spatial relations, and fine-grained attributes, increasing GenEval accuracy from $63\%$ to $95\%$. In visual text rendering, accuracy improves from $59\%$ to $92\%$, greatly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, very little reward hacking occurred, meaning rewards did not increase at the cost of appreciable image quality or diversity degradation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Flow-GRPO, the first method to integrate online policy-gradient RL into flow-matching models for text-to-image generation. It introduces two components: (1) an ODE-to-SDE conversion claimed to produce an SDE whose marginal distribution exactly matches the original flow model at every timestep, enabling on-manifold exploration for RL, and (2) a Denoising Reduction strategy that lowers the number of denoising steps during training while preserving inference steps. Experiments on SD3.5-M report large gains on compositional generation (GenEval accuracy 63% to 95%) and visual text rendering (59% to 92%), with substantial human-preference alignment and minimal reward hacking.

Significance. If the ODE-to-SDE conversion is shown to preserve marginals exactly, the work would provide a practical route for applying online RL to flow-based generative models, potentially improving controllability and alignment on complex tasks without severe distribution shift or reward hacking. The scale of the reported empirical gains suggests the approach could be impactful for downstream applications in compositional and text-conditioned image synthesis.

major comments (2)
  1. [§3.2] §3.2, ODE-to-SDE conversion: The manuscript asserts that the derived SDE matches the original flow-matching marginal p_t(x) at all timesteps so that SDE trajectories remain valid training data for the RL policy. No explicit SDE coefficients (drift and diffusion) or derivation verifying that the Fokker-Planck equation holds exactly for the flow ODE's velocity field are supplied; without this, the central claim that SDE samples introduce no distribution shift cannot be verified and the reported gains cannot be attributed to the proposed mechanism.
  2. [§4] §4, Experimental results: Large improvements are reported (GenEval 63%→95%, text rendering 59%→92%), yet the manuscript provides neither the precise reward functions, baseline RL implementations, number of policy updates, statistical significance tests, nor ablations that isolate the ODE-to-SDE conversion from the denoising-reduction component. This absence makes it impossible to confirm that the gains stem from the claimed technical contributions rather than implementation details or hyperparameter tuning.
minor comments (2)
  1. [Notation] The notation distinguishing the original flow ODE from the converted SDE would be clearer if both sets of equations were presented side-by-side in the main text rather than deferred to the appendix.
  2. [Figure 2] Figure 2 (or equivalent) illustrating the training pipeline would benefit from explicit labels indicating where marginal preservation is enforced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the technical contributions and improve reproducibility. We address each major comment below and will revise the manuscript accordingly to provide the requested details and derivations.

read point-by-point responses
  1. Referee: [§3.2] §3.2, ODE-to-SDE conversion: The manuscript asserts that the derived SDE matches the original flow-matching marginal p_t(x) at all timesteps so that SDE trajectories remain valid training data for the RL policy. No explicit SDE coefficients (drift and diffusion) or derivation verifying that the Fokker-Planck equation holds exactly for the flow ODE's velocity field are supplied; without this, the central claim that SDE samples introduce no distribution shift cannot be verified and the reported gains cannot be attributed to the proposed mechanism.

    Authors: We agree that the manuscript would benefit from an explicit derivation. The ODE-to-SDE conversion is constructed by adding a diffusion term whose coefficient is derived from the flow velocity field such that the Fokker-Planck equation is satisfied identically, ensuring the marginals p_t(x) remain unchanged. In the revision we will include the closed-form drift and diffusion coefficients together with the step-by-step verification that the Fokker-Planck operator applied to the flow velocity yields zero divergence from the original probability flow. revision: yes

  2. Referee: [§4] §4, Experimental results: Large improvements are reported (GenEval 63%→95%, text rendering 59%→92%), yet the manuscript provides neither the precise reward functions, baseline RL implementations, number of policy updates, statistical significance tests, nor ablations that isolate the ODE-to-SDE conversion from the denoising-reduction component. This absence makes it impossible to confirm that the gains stem from the claimed technical contributions rather than implementation details or hyperparameter tuning.

    Authors: We acknowledge the need for greater experimental transparency. The revised manuscript will report: the exact reward functions (including the GenEval and text-rendering reward formulations), the baseline RL implementations used for comparison, the total number of policy-gradient updates, standard deviations and statistical significance tests across multiple random seeds, and dedicated ablations that separately disable the ODE-to-SDE conversion and the Denoising Reduction strategy while keeping all other hyperparameters fixed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard components without self-referential reduction

full rationale

The paper presents Flow-GRPO as an integration of online policy gradient RL with flow matching via two strategies: ODE-to-SDE conversion (asserted to preserve marginals exactly) and Denoising Reduction. These are described as novel combinations of existing techniques rather than derivations that collapse to author-defined fits or self-citations. No equations in the provided text reduce a claimed prediction or result to an input by construction (e.g., no fitted parameter renamed as output, no uniqueness theorem imported from overlapping prior work). Empirical gains are reported as measured outcomes on benchmarks, not tautological. The central assumption about marginal preservation is a technical claim open to verification but does not constitute circularity under the specified patterns. The derivation chain is self-contained against external RL and flow-matching literature.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities beyond the two named strategies; full paper would be needed to audit any implicit assumptions in the conversion or reward formulation.

pith-pipeline@v0.9.0 · 5524 in / 1076 out tokens · 34019 ms · 2026-05-11T18:39:34.536095+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Continuous-Time Distribution Matching for Few-Step Diffusion Distillation

    cs.CV 2026-05 unverdicted novelty 8.0

    CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.

  2. OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

    cs.CV 2026-04 unverdicted novelty 8.0

    OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

  3. Geo-Align: Video Generation Alignment via Metric Geometry Reward

    cs.CV 2026-05 unverdicted novelty 7.0

    Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.

  4. Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.

  5. AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

    cs.AI 2026-05 unverdicted novelty 7.0

    AutoRubric-T2I learns and selects explicit rubrics from preference pairs to guide VLM judges, producing high-quality interpretable rewards for T2I alignment with far less data than traditional Bradley-Terry models.

  6. AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

    cs.AI 2026-05 unverdicted novelty 7.0

    AutoRubric-T2I learns a small set of interpretable rubrics for VLM judges that outperform scalar reward models on T2I benchmarks while using far less preference data.

  7. ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

    cs.CV 2026-05 unverdicted novelty 7.0

    ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.

  8. KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

    cs.CV 2026-05 unverdicted novelty 7.0

    KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.

  9. CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

    cs.CV 2026-05 conditional novelty 7.0

    CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...

  10. Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

    cs.CV 2026-05 unverdicted novelty 7.0

    Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

  11. OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.

  12. CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

    cs.CV 2026-05 unverdicted novelty 7.0

    CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...

  13. Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs

    cs.CV 2026-05 unverdicted novelty 7.0

    PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.

  14. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 conditional novelty 7.0

    Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Di...

  15. NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

  16. Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline

    cs.AI 2026-05 unverdicted novelty 7.0

    A new adjoint matching framework formulates flow model alignment as optimal control, enabling direct regression training and terminal-trajectory truncation for efficiency gains on models like SiT-XL and FLUX.

  17. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 7.0

    D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by minimizing distribution differences between a text-only student and a multimodal teacher on the student's o...

  18. Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.

  19. HP-Edit: A Human-Preference Post-Training Framework for Image Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.

  20. Generative Texture Filtering

    cs.CV 2026-04 unverdicted novelty 7.0

    A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.

  21. Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.

  22. LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

    cs.CV 2026-04 unverdicted novelty 7.0

    LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

  23. Personalizing Text-to-Image Generation to Individual Taste

    cs.CV 2026-04 unverdicted novelty 7.0

    PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.

  24. YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance

    eess.AS 2026-03 unverdicted novelty 7.0

    YingMusic-Singer-Plus is a diffusion model for singing voice synthesis that preserves melody from a reference clip while allowing flexible lyric changes without manual alignment, outperforming Vevo2 and introducing th...

  25. Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

    cs.CV 2026-03 unverdicted novelty 7.0

    SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.

  26. Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

    cs.CV 2026-02 unverdicted novelty 7.0

    DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.

  27. Bird-SR: Bidirectional Reward-Guided Diffusion for Real-World Image Super-Resolution

    cs.CV 2026-02 unverdicted novelty 7.0

    Bird-SR outperforms prior super-resolution methods on real images by guiding diffusion trajectories with bidirectional rewards, early structure optimization on synthetic pairs, and later perceptual rewards with dynami...

  28. Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching

    cs.LG 2025-09 conditional novelty 7.0

    Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.

  29. DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    cs.LG 2025-09 unverdicted novelty 7.0

    DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-...

  30. MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    cs.AI 2025-07 unverdicted novelty 7.0

    MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.

  31. B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation

    cs.CV 2026-05 unverdicted novelty 6.0

    B-GRTO extends GRPO by reusing rollouts to optimize auxiliary segmentation decoder objectives, yielding substantial gains over plain GRPO on referring segmentation tasks.

  32. FlowErase-RL: Rethinking Concept Erasure as Reward Optimization in Flow Matching Models

    cs.CV 2026-05 unverdicted novelty 6.0

    FlowErase-RL is the first GRPO-based reward optimization framework for concept erasure in flow matching models, using a dynamic dual-path reward mechanism to suppress target concepts while preserving generative quality.

  33. GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.

  34. Latent Action Control for Reasoning-Guided Unified Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Latent Action Control learns unobserved action trajectories via variational alignment and GRPO to inject reasoning into flow-based image generation, yielding gains on compositional benchmarks.

  35. Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

    cs.AI 2026-05 unverdicted novelty 6.0

    Proposes HT-GRPO with sketch-then-paint staged updates, prompt-conditioned importance ratios, and hierarchical credit assignment for dMLLMs, reporting gains on GenEval and DPG plus quality metrics.

  36. Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

    cs.CV 2026-05 unverdicted novelty 6.0

    Flash-GRPO introduces iso-temporal grouping and temporal gradient rectification to enable single-step GRPO training that outperforms full-trajectory methods on video diffusion alignment under low compute budgets.

  37. ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

    cs.CV 2026-05 unverdicted novelty 6.0

    ElasticDiT introduces an elastic DiT architecture with adjustable spatial compression and block depth plus Shift Sparse Block Attention and a distilled VAE to enable a single model to cover multiple fidelity-latency p...

  38. Video Models Can Reason with Verifiable Rewards

    cs.CV 2026-05 unverdicted novelty 6.0

    VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Ma...

  39. Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    CLVR couples verified logical planning with pixel diffusion, uses proxy reinforcement learning on distilled histories, and merges weights to cut inference to 4 NFEs while outperforming open-source T2I models on comple...

  40. Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    CLVR framework adds closed-loop visual verification, proxy prompt reinforcement learning, and delta-space weight merge to improve complex text-to-image generation over single-step or unverified multi-step baselines.

  41. PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

    cs.CV 2026-05 conditional novelty 6.0

    PhyMotion scores generated human videos by grounding recovered 3D poses in a physics simulator across kinematic, contact, and dynamic axes, yielding stronger human correlation and larger RL post-training gains than pr...

  42. Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Skill-aligned annotation improves inter-annotator agreement and evaluation stability in text-to-image generation compared to uniform annotation baselines.

  43. When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

    cs.CV 2026-05 unverdicted novelty 6.0

    Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...

  44. Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

    cs.CV 2026-05 unverdicted novelty 6.0

    Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...

  45. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.

  46. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Flow-OPD applies on-policy distillation to Flow Matching models through specialized teachers, cold-start initialization, task routing, and manifold regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 o...

  47. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.

  48. Slowly Annealed Langevin Dynamics: Theory and Applications to Training-Free Guided Generation

    cs.LG 2026-05 unverdicted novelty 6.0

    Slowly Annealed Langevin Dynamics provides non-asymptotic KL-based convergence guarantees for tracking moving targets and enables training-free guided generation via a velocity-aware correction that accounts for pretr...

  49. From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data

    cs.CV 2026-05 unverdicted novelty 6.0

    The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world perfor...

  50. Flow-Direct: Feedback-Efficient and Reusable Guidance for Flow Models via Non-Parametric Guidance Field

    cs.LG 2026-05 unverdicted novelty 6.0

    Flow-Direct constructs a reusable non-parametric guidance field from the log-density ratio of base and target distributions using all accumulated reward samples for feedback-efficient guidance in flow models.

  51. Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

    cs.CV 2026-05 unverdicted novelty 6.0

    DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.

  52. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

  53. Threshold-Guided Optimization for Visual Generative Models

    cs.LG 2026-05 unverdicted novelty 6.0

    A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.

  54. OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

    cs.LG 2026-05 unverdicted novelty 6.0

    OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.

  55. Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

    cs.RO 2026-05 unverdicted novelty 6.0

    Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.

  56. Leveraging Verifier-Based Reinforcement Learning in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Edit-R1 builds a CoT-based reasoning reward model (RRM) via SFT and GCPO, then applies it with GRPO to improve image editing models such as FLUX.1-kontext.

  57. Leveraging Verifier-Based Reinforcement Learning in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.

  58. DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.

  59. ViPO: Visual Preference Optimization at Scale

    cs.CV 2026-04 unverdicted novelty 6.0

    Poly-DPO improves robustness to noisy preference data in visual models, and the new ViPO dataset enables superior performance, with the method reducing to standard DPO on high-quality data.

  60. Meta-CoT: Enhancing Granularity and Generalization in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 94 Pith papers · 28 internal anchors

  1. [1]

    Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

  2. [2]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 10

  3. [3]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  4. [4]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  5. [5]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  6. [6]

    T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

    Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

  7. [7]

    Gpt-imgeval: A comprehen- sive benchmark for diagnosing gpt4o in image generation

    Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, and Li Yuan. Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation.arXiv preprint arXiv:2504.02782, 2025

  8. [8]

    Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353– 9387, 2023

    Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353– 9387, 2023

  9. [9]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  10. [10]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  11. [11]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  12. [13]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

  13. [14]

    Improving Video Generation with Human Feedback

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback. arXiv preprint arXiv:2501.13918, 2025

  14. [15]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Juncheng Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengchen Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

  15. [16]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  16. [17]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  17. [18]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  18. [19]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023. 11

  19. [21]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  20. [22]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  21. [23]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

  22. [24]

    Kling ai.https://klingai.kuaishou.com/, 2024

    Kuaishou. Kling ai.https://klingai.kuaishou.com/, 2024

  23. [25]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  24. [26]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1:8, 2024

  25. [27]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  26. [28]

    Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

    Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.arXiv preprint arXiv:2303.08797, 2023

  27. [29]

    Adjoint matching: Fine- tuning flow and diffusion generative models with memoryless stochastic optimal control.arXiv preprint arXiv:2409.08861, 2024

    Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, and Ricky TQ Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control. arXiv preprint arXiv:2409.08861, 2024

  28. [30]

    arXiv:2310.03739, 2023

    Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text- to-image diffusion models with reward backpropagation.arXiv preprint arXiv:2310.03739, 2023

  29. [31]

    Directly Fine-Tuning Diffusion Models on Differentiable Rewards

    Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards.arXiv preprint arXiv:2309.17400, 2023

  30. [32]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024

  31. [33]

    Video diffusion alignment via reward gradients.arXiv preprint arXiv:2407.08737, 2024

    Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Katerina Fragkiadaki, and Deepak Pathak. Video diffusion alignment via reward gradients.arXiv preprint arXiv:2407.08737, 2024

  32. [35]

    Online reward-weighted fine-tuning of flow matching with wasserstein regularization

    Jiajun Fan, Shuaike Shen, Chaoran Cheng, Yuxin Chen, Chumeng Liang, and Ge Liu. Online reward-weighted fine-tuning of flow matching with wasserstein regularization. InThe Thirteenth International Conference on Learning Representations, 2025

  33. [36]

    Aligning Text-to-Image Models using Human Feedback

    Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023

  34. [37]

    RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

    Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment.arXiv preprint arXiv:2304.06767, 2023. 12

  35. [38]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024

  36. [39]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

  37. [40]

    Using human feedback to fine-tune diffusion models without any reward model

    Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024

  38. [41]

    Step-aware preference optimization: Aligning preference with denoising performance at each step,

    Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Ji Li, and Liang Zheng. Step-aware preference optimization: Aligning preference with denoising performance at each step.arXiv preprint arXiv:2406.04314, 2024

  39. [42]

    Self-play fine-tuning of diffusion models for text-to-image generation,

    Huizhuo Yuan, Zixiang Chen, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning of diffusion models for text-to-image generation.arXiv preprint arXiv:2402.10210, 2024

  40. [43]

    Videodpo: Omni-preference alignment for video diffusion generation,

    Runtao Liu, Haoyu Wu, Zheng Ziqiang, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni-preference alignment for video diffusion generation.arXiv preprint arXiv:2412.14167, 2024

  41. [44]

    Onlinevpo: Align video diffusion model with online video-centric preference optimization.arXiv preprint arXiv:2412.15159, 2024

    Jiacheng Zhang, Jie Wu, Weifeng Chen, Yatai Ji, Xuefeng Xiao, Weilin Huang, and Kai Han. Onlinevpo: Align video diffusion model with online video-centric preference optimization. arXiv preprint arXiv:2412.15159, 2024

  42. [45]

    Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

    Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, and Sherry Yang. Improving dynamic object interactions in text-to-video generation with ai feedback.arXiv preprint arXiv:2412.02617, 2024

  43. [46]

    Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization

    Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13199–13208, 2025

  44. [47]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  45. [48]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

  46. [49]

    Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36, 2024

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36, 2024

  47. [50]

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter

    Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin, Sreya Dutta Roy, Harrie Oosterhuis, Maarten de Rijke, and Satya Narayan Shukla. A simple and effective reinforcement learning method for text-to-image diffusion fine-tuning.arXiv preprint arXiv:2503.00897, 2025

  48. [51]

    Training diffusion models towards diverse image generation with reinforcement learning

    Zichen Miao, Jiang Wang, Ze Wang, Zhengyuan Yang, Lijuan Wang, Qiang Qiu, and Zicheng Liu. Training diffusion models towards diverse image generation with reinforcement learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10844–10853, 2024

  49. [52]

    D., and Tang, W

    Hanyang Zhao, Haoxian Chen, Ji Zhang, David D Yao, and Wenpin Tang. Score as action: Fine-tuning diffusion generative models by continuous-time reinforcement learning.arXiv preprint arXiv:2502.01819, 2025

  50. [53]

    Training-free diffusion model alignment with sampling demons.arXiv preprint arXiv:2410.05760, 2024

    Po-Hung Yeh, Kuang-Huei Lee, and Jun-Cheng Chen. Training-free diffusion model alignment with sampling demons.arXiv preprint arXiv:2410.05760, 2024. 13

  51. [54]

    Inference-time alignment of diffusion models with direct noise optimization

    Zhiwei Tang, Jiangweizhi Peng, Jiasheng Tang, Mingyi Hong, Fan Wang, and Tsung-Hui Chang. Tuning-free alignment of diffusion models with direct noise optimization.arXiv preprint arXiv:2405.18881, 2024

  52. [55]

    Loss-guided diffusion models for plug-and-play controllable generation

    Jiaming Song, Qinsheng Zhang, Hongxu Yin, Morteza Mardani, Ming-Yu Liu, Jan Kautz, Yongxin Chen, and Arash Vahdat. Loss-guided diffusion models for plug-and-play controllable generation. InInternational Conference on Machine Learning, pages 32483–32498. PMLR, 2023

  53. [56]

    F5r-tts: Improving flow matching based text-to-speech with group relative policy optimization

    Xiaohui Sun, Ruitong Xiao, Jianye Mo, Bowen Wu, Qun Yu, and Baoxun Wang. F5r-tts: Improving flow matching based text-to-speech with group relative policy optimization.arXiv preprint arXiv:2504.02407, 2025

  54. [57]

    Inference-time scaling for flow models via stochastic generation and rollover budget forcing.arXiv preprint arXiv:2503.19385, 2025

    Jaihoon Kim, Taehoon Yoon, Jisung Hwang, and Minhyuk Sung. Inference-time scaling for flow models via stochastic generation and rollover budget forcing.arXiv preprint arXiv:2503.19385, 2025

  55. [58]

    Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

    Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025

  56. [59]

    Laion aesthetics, Aug 2022

    Chrisoph Schuhmann. Laion aesthetics, Aug 2022

  57. [60]

    Teaching large language models to regress accurate image quality scores using score distribution.arXiv preprint arXiv:2501.11561, 2025

    Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution.arXiv preprint arXiv:2501.11561, 2025

  58. [61]

    Unified Reward Model for Multimodal Understanding and Generation

    Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025

  59. [62]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  60. [63]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  61. [64]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

  62. [65]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

  63. [66]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single trans- former to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

  64. [67]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024

  65. [68]

    2025.doi:10.48550/arXiv.2411.07975

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Liang Zhao, et al. Janusflow: Harmonizing autoregres- sion and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

  66. [69]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025. 14

  67. [70]

    High- resolutionimagesynthesiswithlatentdiffusionmodels

    Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv preprint arXiv:2501.18427, 2025

  68. [71]

    ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

    Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864, 2025

  69. [72]

    Acereason-nemotron: Advancing math and code reasoning through reinforcement learning

    Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron: Advancing math and code reasoning through reinforcement learning.arXiv preprint arXiv:2505.16400, 2025

  70. [73]

    T2i- compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i- compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  71. [74]

    Springer, 2003

    Bernt Øksendal and Bernt Øksendal.Stochastic differential equations. Springer, 2003

  72. [75]

    Reverse-time diffusion equation models.Stochastic Processes and their Applications, 12(3):313–326, 1982

    Brian DO Anderson. Reverse-time diffusion equation models.Stochastic Processes and their Applications, 12(3):313–326, 1982

  73. [76]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019. 15 Appendix of Flow-GRPO: Training Flow Matching Models via Online RL A Mathematical Derivations for Stochastic Sampling using Flow Models 17 B Further Details on the Experi...