pith. machine review for the scientific record. sign in

arxiv: 2505.05470 · v5 · submitted 2025-05-08 · 💻 cs.CV · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Flow-GRPO: Training Flow Matching Models via Online RL

Authors on Pith no claims yet

Pith reviewed 2026-05-11 18:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords flow matchingreinforcement learningpolicy gradienttext-to-image generationODE to SDEdenoising reductiongenerative models
0
0 comments X

The pith

Flow matching models can be trained with online policy gradient reinforcement learning by converting their ODE to an equivalent SDE with identical marginals at every timestep and by reducing denoising steps during training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to apply online reinforcement learning to flow matching models used for image generation. It converts the deterministic ODE trajectory into a stochastic differential equation whose probability distribution matches the original model exactly at each timestep, which supplies valid samples for the RL exploration process. A separate strategy shortens the number of denoising steps used while the model is being trained, yet leaves the full number of steps available at inference time. This combination produces large gains on tasks that require precise control over object counts, spatial arrangements, and embedded text. The resulting models show improved alignment with human preferences while exhibiting little reward hacking that would trade away image quality or diversity.

Core claim

Flow matching models, which learn a velocity field along deterministic ODE paths, become amenable to online policy-gradient RL once their ODE is converted to an SDE whose marginal distribution equals the flow model's distribution at every timestep; a denoising-reduction schedule then accelerates the RL updates without altering the original inference procedure.

What carries the argument

The ODE-to-SDE conversion that produces an SDE whose marginal distribution exactly matches the original flow-matching model at every timestep, thereby supplying statistically valid trajectories for RL exploration.

If this is right

  • Generative performance improves substantially on tasks requiring accurate object counts, spatial relations, and fine-grained attributes.
  • Accuracy on visual text rendering increases markedly.
  • Alignment with human preferences rises while image quality and diversity remain largely intact.
  • The same procedure applies across multiple text-to-image tasks with minimal reward hacking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ODE-to-SDE device could be applied to other deterministic continuous-time generative models to enable RL fine-tuning.
  • The training-time efficiency gain from fewer denoising steps may make the approach practical for higher-resolution or longer-horizon models.
  • Reward functions could be engineered to target specific remaining failure modes such as counting errors or text legibility without retraining from scratch.

Load-bearing premise

The ODE-to-SDE conversion produces an SDE whose marginal distribution exactly matches the original flow-matching model at every timestep.

What would settle it

Sampling states from the converted SDE at an intermediate timestep and finding that their distribution differs from the distribution of states reached by integrating the original ODE to the same timestep would show that the RL policy is being trained on invalid data.

read the original abstract

We propose Flow-GRPO, the first method to integrate online policy gradient reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original number of inference steps, significantly improving sampling efficiency without sacrificing performance. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For compositional generation, RL-tuned SD3.5-M generates nearly perfect object counts, spatial relations, and fine-grained attributes, increasing GenEval accuracy from $63\%$ to $95\%$. In visual text rendering, accuracy improves from $59\%$ to $92\%$, greatly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, very little reward hacking occurred, meaning rewards did not increase at the cost of appreciable image quality or diversity degradation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Flow-GRPO, the first method to integrate online policy-gradient RL into flow-matching models for text-to-image generation. It introduces two components: (1) an ODE-to-SDE conversion claimed to produce an SDE whose marginal distribution exactly matches the original flow model at every timestep, enabling on-manifold exploration for RL, and (2) a Denoising Reduction strategy that lowers the number of denoising steps during training while preserving inference steps. Experiments on SD3.5-M report large gains on compositional generation (GenEval accuracy 63% to 95%) and visual text rendering (59% to 92%), with substantial human-preference alignment and minimal reward hacking.

Significance. If the ODE-to-SDE conversion is shown to preserve marginals exactly, the work would provide a practical route for applying online RL to flow-based generative models, potentially improving controllability and alignment on complex tasks without severe distribution shift or reward hacking. The scale of the reported empirical gains suggests the approach could be impactful for downstream applications in compositional and text-conditioned image synthesis.

major comments (2)
  1. [§3.2] §3.2, ODE-to-SDE conversion: The manuscript asserts that the derived SDE matches the original flow-matching marginal p_t(x) at all timesteps so that SDE trajectories remain valid training data for the RL policy. No explicit SDE coefficients (drift and diffusion) or derivation verifying that the Fokker-Planck equation holds exactly for the flow ODE's velocity field are supplied; without this, the central claim that SDE samples introduce no distribution shift cannot be verified and the reported gains cannot be attributed to the proposed mechanism.
  2. [§4] §4, Experimental results: Large improvements are reported (GenEval 63%→95%, text rendering 59%→92%), yet the manuscript provides neither the precise reward functions, baseline RL implementations, number of policy updates, statistical significance tests, nor ablations that isolate the ODE-to-SDE conversion from the denoising-reduction component. This absence makes it impossible to confirm that the gains stem from the claimed technical contributions rather than implementation details or hyperparameter tuning.
minor comments (2)
  1. [Notation] The notation distinguishing the original flow ODE from the converted SDE would be clearer if both sets of equations were presented side-by-side in the main text rather than deferred to the appendix.
  2. [Figure 2] Figure 2 (or equivalent) illustrating the training pipeline would benefit from explicit labels indicating where marginal preservation is enforced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the technical contributions and improve reproducibility. We address each major comment below and will revise the manuscript accordingly to provide the requested details and derivations.

read point-by-point responses
  1. Referee: [§3.2] §3.2, ODE-to-SDE conversion: The manuscript asserts that the derived SDE matches the original flow-matching marginal p_t(x) at all timesteps so that SDE trajectories remain valid training data for the RL policy. No explicit SDE coefficients (drift and diffusion) or derivation verifying that the Fokker-Planck equation holds exactly for the flow ODE's velocity field are supplied; without this, the central claim that SDE samples introduce no distribution shift cannot be verified and the reported gains cannot be attributed to the proposed mechanism.

    Authors: We agree that the manuscript would benefit from an explicit derivation. The ODE-to-SDE conversion is constructed by adding a diffusion term whose coefficient is derived from the flow velocity field such that the Fokker-Planck equation is satisfied identically, ensuring the marginals p_t(x) remain unchanged. In the revision we will include the closed-form drift and diffusion coefficients together with the step-by-step verification that the Fokker-Planck operator applied to the flow velocity yields zero divergence from the original probability flow. revision: yes

  2. Referee: [§4] §4, Experimental results: Large improvements are reported (GenEval 63%→95%, text rendering 59%→92%), yet the manuscript provides neither the precise reward functions, baseline RL implementations, number of policy updates, statistical significance tests, nor ablations that isolate the ODE-to-SDE conversion from the denoising-reduction component. This absence makes it impossible to confirm that the gains stem from the claimed technical contributions rather than implementation details or hyperparameter tuning.

    Authors: We acknowledge the need for greater experimental transparency. The revised manuscript will report: the exact reward functions (including the GenEval and text-rendering reward formulations), the baseline RL implementations used for comparison, the total number of policy-gradient updates, standard deviations and statistical significance tests across multiple random seeds, and dedicated ablations that separately disable the ODE-to-SDE conversion and the Denoising Reduction strategy while keeping all other hyperparameters fixed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard components without self-referential reduction

full rationale

The paper presents Flow-GRPO as an integration of online policy gradient RL with flow matching via two strategies: ODE-to-SDE conversion (asserted to preserve marginals exactly) and Denoising Reduction. These are described as novel combinations of existing techniques rather than derivations that collapse to author-defined fits or self-citations. No equations in the provided text reduce a claimed prediction or result to an input by construction (e.g., no fitted parameter renamed as output, no uniqueness theorem imported from overlapping prior work). Empirical gains are reported as measured outcomes on benchmarks, not tautological. The central assumption about marginal preservation is a technical claim open to verification but does not constitute circularity under the specified patterns. The derivation chain is self-contained against external RL and flow-matching literature.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities beyond the two named strategies; full paper would be needed to audit any implicit assumptions in the conversion or reward formulation.

pith-pipeline@v0.9.0 · 5524 in / 1076 out tokens · 34019 ms · 2026-05-11T18:39:34.536095+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 59 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Continuous-Time Distribution Matching for Few-Step Diffusion Distillation

    cs.CV 2026-05 unverdicted novelty 8.0

    CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.

  2. OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

    cs.CV 2026-04 unverdicted novelty 8.0

    OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

  3. ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

    cs.CV 2026-05 unverdicted novelty 7.0

    ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.

  4. KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

    cs.CV 2026-05 unverdicted novelty 7.0

    KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.

  5. CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

    cs.CV 2026-05 conditional novelty 7.0

    CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...

  6. Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

    cs.CV 2026-05 unverdicted novelty 7.0

    Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

  7. OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.

  8. CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

    cs.CV 2026-05 unverdicted novelty 7.0

    CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...

  9. Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs

    cs.CV 2026-05 unverdicted novelty 7.0

    PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.

  10. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 conditional novelty 7.0

    Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Di...

  11. NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

  12. Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline

    cs.AI 2026-05 unverdicted novelty 7.0

    A new adjoint matching framework formulates flow model alignment as optimal control, enabling direct regression training and terminal-trajectory truncation for efficiency gains on models like SiT-XL and FLUX.

  13. Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.

  14. HP-Edit: A Human-Preference Post-Training Framework for Image Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.

  15. Generative Texture Filtering

    cs.CV 2026-04 unverdicted novelty 7.0

    A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.

  16. Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.

  17. LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

    cs.CV 2026-04 unverdicted novelty 7.0

    LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

  18. Personalizing Text-to-Image Generation to Individual Taste

    cs.CV 2026-04 unverdicted novelty 7.0

    PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.

  19. YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance

    eess.AS 2026-03 unverdicted novelty 7.0

    YingMusic-Singer-Plus is a diffusion model for singing voice synthesis that preserves melody from a reference clip while allowing flexible lyric changes without manual alignment, outperforming Vevo2 and introducing th...

  20. DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    cs.LG 2025-09 unverdicted novelty 7.0

    DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-...

  21. MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    cs.AI 2025-07 unverdicted novelty 7.0

    MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.

  22. PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

    cs.CV 2026-05 conditional novelty 6.0

    PhyMotion scores generated human videos by grounding recovered 3D poses in a physics simulator across kinematic, contact, and dynamic axes, yielding stronger human correlation and larger RL post-training gains than pr...

  23. Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Skill-aligned annotation improves inter-annotator agreement and evaluation stability in text-to-image generation compared to uniform annotation baselines.

  24. When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

    cs.CV 2026-05 unverdicted novelty 6.0

    Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...

  25. Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

    cs.CV 2026-05 unverdicted novelty 6.0

    Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...

  26. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.

  27. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.

  28. Slowly Annealed Langevin Dynamics: Theory and Applications to Training-Free Guided Generation

    cs.LG 2026-05 unverdicted novelty 6.0

    Slowly Annealed Langevin Dynamics provides non-asymptotic KL-based convergence guarantees for tracking moving targets and enables training-free guided generation via a velocity-aware correction that accounts for pretr...

  29. From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data

    cs.CV 2026-05 unverdicted novelty 6.0

    The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world perfor...

  30. Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

    cs.CV 2026-05 unverdicted novelty 6.0

    DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.

  31. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

  32. Threshold-Guided Optimization for Visual Generative Models

    cs.LG 2026-05 unverdicted novelty 6.0

    A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.

  33. OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

    cs.LG 2026-05 unverdicted novelty 6.0

    OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.

  34. Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

    cs.RO 2026-05 unverdicted novelty 6.0

    Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.

  35. Leveraging Verifier-Based Reinforcement Learning in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.

  36. DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.

  37. ViPO: Visual Preference Optimization at Scale

    cs.CV 2026-04 unverdicted novelty 6.0

    Poly-DPO improves robustness to noisy preference data in visual models, and the new ViPO dataset enables superior performance, with the method reducing to standard DPO on high-quality data.

  38. Meta-CoT: Enhancing Granularity and Generalization in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.

  39. POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    POCA combines Pareto optimization with curriculum alignment to improve multi-reward reinforcement learning for visual text generation without relying on weighted sums.

  40. V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

    cs.LG 2026-04 unverdicted novelty 6.0

    V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.

  41. Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation

    cs.CV 2026-04 unverdicted novelty 6.0

    Long-CODE isolates long-context video evaluation with a new benchmark dataset and shot-dynamics metric that correlates better with human judgments on narrative richness and global consistency than short-video metrics.

  42. Region-Constrained Group Relative Policy Optimization for Flow-Based Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    RC-GRPO-Editing constrains GRPO exploration to editing regions via localized noise and attention rewards, improving instruction adherence and non-target preservation in flow-based image editing.

  43. MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.

  44. FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

    cs.LG 2026-04 unverdicted novelty 6.0

    Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.

  45. CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning

    cs.LG 2026-03 unverdicted novelty 6.0

    CellFluxRL post-trains the CellFlux generative model with reinforcement learning driven by biologically meaningful reward functions, yielding virtual cell images that better satisfy physical and biological constraints...

  46. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    cs.CL 2025-06 conditional novelty 6.0

    High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.

  47. EponaV2: Driving World Model with Comprehensive Future Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.

  48. SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    cs.CV 2026-05 unverdicted novelty 5.0

    SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

  49. Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies

    cs.LG 2026-05 unverdicted novelty 5.0

    Unsupervised behavioral mode discovery combined with mutual information rewards enables RL fine-tuning of multimodal generative policies that achieves higher success rates without losing action diversity.

  50. Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

    cs.CV 2026-05 unverdicted novelty 5.0

    DeScore decouples explicit CoT reasoning from reward regression in video reward models via a two-stage cold-start plus dual-objective RL training pipeline.

  51. Towards General Preference Alignment: Diffusion Models at Nash Equilibrium

    cs.LG 2026-05 unverdicted novelty 5.0

    Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.

  52. A Systematic Post-Train Framework for Video Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.

  53. Reward-Aware Trajectory Shaping for Few-step Visual Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.

  54. TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Guided Optimization

    cs.CV 2026-03 unverdicted novelty 5.0

    TIGFlow-GRPO uses a Trajectory-Interaction-Graph in conditional flow matching plus Flow-GRPO optimization to produce more accurate, socially compliant, and physically feasible trajectory forecasts on ETH/UCY and SDD datasets.

  55. Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    cs.CV 2025-11 unverdicted novelty 5.0

    Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...

  56. Qwen-Image Technical Report

    cs.CV 2025-08 unverdicted novelty 5.0

    Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...

  57. Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

    cs.GR 2026-05 unverdicted novelty 4.0

    JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.

  58. Seedance 1.0: Exploring the Boundaries of Video Generation Models

    cs.CV 2025-06 unverdicted novelty 4.0

    Seedance 1.0 generates 5-second 1080p videos in about 41 seconds with claimed superior motion quality, prompt adherence, and multi-shot consistency compared to prior models.

  59. Seedream 4.0: Toward Next-generation Multimodal Image Generation

    cs.CV 2025-09 unverdicted novelty 3.0

    Seedream 4.0 unifies text-to-image synthesis, image editing, and multi-image composition in an efficient diffusion transformer pretrained on billions of pairs and accelerated to 1.8 seconds for 2K output.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 56 Pith papers · 24 internal anchors

  1. [1]

    Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

  2. [2]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 10

  3. [3]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  4. [4]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  5. [5]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  6. [6]

    T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

    Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

  7. [7]

    Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation.arXiv preprint arXiv:2504.02782,

    Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, and Li Yuan. Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation.arXiv preprint arXiv:2504.02782, 2025

  8. [8]

    Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353– 9387, 2023

    Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353– 9387, 2023

  9. [9]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  10. [10]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  11. [11]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  12. [13]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

  13. [14]

    Improving Video Generation with Human Feedback

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback. arXiv preprint arXiv:2501.13918, 2025

  14. [15]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Juncheng Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengchen Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

  15. [16]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  16. [17]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  17. [18]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  18. [19]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023. 11

  19. [21]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  20. [22]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  21. [23]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

  22. [24]

    Kling ai.https://klingai.kuaishou.com/, 2024

    Kuaishou. Kling ai.https://klingai.kuaishou.com/, 2024

  23. [25]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  24. [26]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1:8, 2024

  25. [27]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  26. [28]

    Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

    Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.arXiv preprint arXiv:2303.08797, 2023

  27. [29]

    Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control

    Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, and Ricky TQ Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control. arXiv preprint arXiv:2409.08861, 2024

  28. [30]

    Aligning text-to- image diffusion models with reward backpropagation (2023).arXiv preprint arXiv:2310.03739,

    Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text- to-image diffusion models with reward backpropagation.arXiv preprint arXiv:2310.03739, 2023

  29. [31]

    Directly fine-tuning diffusion models on differentiable rewards.arXiv preprint arXiv:2309.17400,

    Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards.arXiv preprint arXiv:2309.17400, 2023

  30. [32]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024

  31. [33]

    Prabhudesai, M., Goyal, A., Pathak, D., and Fragkiadaki, K

    Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Katerina Fragkiadaki, and Deepak Pathak. Video diffusion alignment via reward gradients.arXiv preprint arXiv:2407.08737, 2024

  32. [35]

    Online reward-weighted fine-tuning of flow matching with wasserstein regularization

    Jiajun Fan, Shuaike Shen, Chaoran Cheng, Yuxin Chen, Chumeng Liang, and Ge Liu. Online reward-weighted fine-tuning of flow matching with wasserstein regularization. InThe Thirteenth International Conference on Learning Representations, 2025

  33. [36]

    Aligning Text-to-Image Models using Human Feedback

    Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023

  34. [37]

    arXiv preprint arXiv:2304.06767 , year=

    Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment.arXiv preprint arXiv:2304.06767, 2023. 12

  35. [38]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024

  36. [39]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

  37. [40]

    Using human feedback to fine-tune diffusion models without any reward model

    Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024

  38. [41]

    Step-aware preference optimization: Aligning preference with denoising performance at each step.arXiv preprint arXiv:2406.04314, 2024

    Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Ji Li, and Liang Zheng. Step-aware preference optimization: Aligning preference with denoising performance at each step.arXiv preprint arXiv:2406.04314, 2024

  39. [42]

    Self-play fine-tuning of diffusion models for text-to-image generation,

    Huizhuo Yuan, Zixiang Chen, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning of diffusion models for text-to-image generation.arXiv preprint arXiv:2402.10210, 2024

  40. [43]

    Videodpo: Omni-preference alignment for video diffusion generation,

    Runtao Liu, Haoyu Wu, Zheng Ziqiang, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni-preference alignment for video diffusion generation.arXiv preprint arXiv:2412.14167, 2024

  41. [44]

    Onlinevpo: Align video diffusion model with online video-centric preference optimization,

    Jiacheng Zhang, Jie Wu, Weifeng Chen, Yatai Ji, Xuefeng Xiao, Weilin Huang, and Kai Han. Onlinevpo: Align video diffusion model with online video-centric preference optimization. arXiv preprint arXiv:2412.15159, 2024

  42. [45]

    Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

    Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, and Sherry Yang. Improving dynamic object interactions in text-to-video generation with ai feedback.arXiv preprint arXiv:2412.02617, 2024

  43. [46]

    Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization

    Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13199–13208, 2025

  44. [47]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  45. [48]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

  46. [49]

    Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36, 2024

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36, 2024

  47. [50]

    A simple and effective reinforcement learning method for text-to-image diffusion fine-tuning.arXiv preprint arXiv:2503.00897, 2025

    Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin, Sreya Dutta Roy, Harrie Oosterhuis, Maarten de Rijke, and Satya Narayan Shukla. A simple and effective reinforcement learning method for text-to-image diffusion fine-tuning.arXiv preprint arXiv:2503.00897, 2025

  48. [51]

    Training diffusion models towards diverse image generation with reinforcement learning

    Zichen Miao, Jiang Wang, Ze Wang, Zhengyuan Yang, Lijuan Wang, Qiang Qiu, and Zicheng Liu. Training diffusion models towards diverse image generation with reinforcement learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10844–10853, 2024

  49. [52]

    Score as action: Fine-tuning diffusion generative models by continuous-time reinforcement learning.arXiv preprint arXiv:2502.01819, 2025

    Hanyang Zhao, Haoxian Chen, Ji Zhang, David D Yao, and Wenpin Tang. Score as action: Fine-tuning diffusion generative models by continuous-time reinforcement learning.arXiv preprint arXiv:2502.01819, 2025

  50. [53]

    Training-free diffusion model alignment with sampling demons.arXiv preprint arXiv:2410.05760, 2024

    Po-Hung Yeh, Kuang-Huei Lee, and Jun-Cheng Chen. Training-free diffusion model alignment with sampling demons.arXiv preprint arXiv:2410.05760, 2024. 13

  51. [54]

    Tuning-free alignment of diffusion models with direct noise optimization.arXiv preprint arXiv:2405.18881, 2024

    Zhiwei Tang, Jiangweizhi Peng, Jiasheng Tang, Mingyi Hong, Fan Wang, and Tsung-Hui Chang. Tuning-free alignment of diffusion models with direct noise optimization.arXiv preprint arXiv:2405.18881, 2024

  52. [55]

    Loss-guided diffusion models for plug-and-play controllable generation

    Jiaming Song, Qinsheng Zhang, Hongxu Yin, Morteza Mardani, Ming-Yu Liu, Jan Kautz, Yongxin Chen, and Arash Vahdat. Loss-guided diffusion models for plug-and-play controllable generation. InInternational Conference on Machine Learning, pages 32483–32498. PMLR, 2023

  53. [56]

    F5r-tts: Improving flow matching based text-to-speech with group relative policy optimization.arXiv preprint arXiv:2504.02407, 2025

    Xiaohui Sun, Ruitong Xiao, Jianye Mo, Bowen Wu, Qun Yu, and Baoxun Wang. F5r-tts: Improving flow matching based text-to-speech with group relative policy optimization.arXiv preprint arXiv:2504.02407, 2025

  54. [57]

    Inference-time scaling for flow models via stochastic generation and rollover budget forcing.arXiv preprint arXiv:2503.19385, 2025

    Jaihoon Kim, Taehoon Yoon, Jisung Hwang, and Minhyuk Sung. Inference-time scaling for flow models via stochastic generation and rollover budget forcing.arXiv preprint arXiv:2503.19385, 2025

  55. [58]

    Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025

    Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025

  56. [59]

    Laion aesthetics, Aug 2022

    Chrisoph Schuhmann. Laion aesthetics, Aug 2022

  57. [60]

    Teaching large language models to regress accurate image quality scores using score distribution.arXiv preprint arXiv:2501.11561, 2025

    Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution.arXiv preprint arXiv:2501.11561, 2025

  58. [61]

    Unified Reward Model for Multimodal Understanding and Generation

    Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025

  59. [62]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  60. [63]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  61. [64]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

  62. [65]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

  63. [66]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single trans- former to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

  64. [67]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024

  65. [68]

    Janusflow: Harmonizing autoregres- sion and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Liang Zhao, et al. Janusflow: Harmonizing autoregres- sion and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

  66. [69]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025. 14

  67. [70]

    Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer

    Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv preprint arXiv:2501.18427, 2025

  68. [71]

    Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025b

    Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864, 2025

  69. [72]

    AceReason-Nemotron: Advancing math and code reasoning through reinforcement learning.arXiv preprint arXiv:2505.16400,

    Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron: Advancing math and code reasoning through reinforcement learning.arXiv preprint arXiv:2505.16400, 2025

  70. [73]

    T2i- compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i- compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  71. [74]

    Springer, 2003

    Bernt Øksendal and Bernt Øksendal.Stochastic differential equations. Springer, 2003

  72. [75]

    Reverse-time diffusion equation models.Stochastic Processes and their Applications, 12(3):313–326, 1982

    Brian DO Anderson. Reverse-time diffusion equation models.Stochastic Processes and their Applications, 12(3):313–326, 1982

  73. [76]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019. 15 Appendix of Flow-GRPO: Training Flow Matching Models via Online RL A Mathematical Derivations for Stochastic Sampling using Flow Models 17 B Further Details on the Experi...