pith. machine review for the scientific record. sign in

arxiv: 2507.21802 · v6 · submitted 2025-07-29 · 💻 cs.AI · cs.CV

Recognition: 3 theorem links

· Lean Theorem

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li , Yutao Cui , Tao Huang , Yinping Ma , Chun Fan , Yiming Cheng , Miles Yang , Zhao Zhong , Liefeng Bo

Authors on Pith no claims yet

Pith reviewed 2026-05-13 13:25 UTC · model grok-4.3

classification 💻 cs.AI cs.CV
keywords MixGRPOGRPOflow matchinghuman preference alignmentODE-SDE mixingsliding windowtraining efficiencyimage generation
0
0 comments X

The pith

MixGRPO improves GRPO efficiency for flow matching image models by restricting SDE sampling and optimization to a sliding window while using ODE sampling outside it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MixGRPO to address the inefficiency of GRPO in aligning flow matching models with human preferences for image generation. Previous approaches like FlowGRPO and DanceGRPO require sampling and optimizing over all denoising steps dictated by the Markov Decision Process, which creates high overhead. MixGRPO mixes strategies by applying SDE sampling and GRPO-guided optimization only inside a sliding window of time-steps and deterministic ODE sampling elsewhere. This confines randomness and gradient updates to fewer steps, reduces optimization overhead, accelerates convergence, and supports higher-order solvers outside the window for even faster sampling in the MixGRPO-Flash variant.

Core claim

By integrating SDE sampling and GRPO-guided optimization within a sliding window and ODE sampling outside it, MixGRPO streamlines the MDP optimization in flow matching models. This confines sampling randomness to the time-steps within the window, thereby reducing the optimization overhead and allowing for more focused gradient updates to accelerate convergence. Time-steps beyond the sliding window support higher-order solvers for faster sampling, yielding the MixGRPO-Flash variant that further improves training efficiency while achieving comparable performance.

What carries the argument

The sliding window mechanism that applies SDE sampling and GRPO optimization only within selected denoising steps and ODE sampling outside the window to confine randomness and focus updates.

If this is right

  • Higher-order ODE solvers can be applied outside the window for faster sampling without affecting optimization quality.
  • Training time drops by nearly 50 percent compared to DanceGRPO while delivering stronger human preference alignment across multiple dimensions.
  • The MixGRPO-Flash variant achieves comparable results with 71 percent lower training time.
  • Focused gradient updates within the window accelerate convergence of the alignment process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The mixed ODE-SDE window strategy may transfer to other reinforcement-learning alignment methods that currently optimize over full denoising trajectories.
  • Dynamically resizing the window during training could further balance speed and final alignment quality.
  • The same restriction of stochastic steps might reduce memory or compute costs in video or 3D generative models that use many denoising iterations.
  • Testing the approach on non-image flow models would reveal whether the efficiency gain is specific to image denoising schedules.

Load-bearing premise

That restricting SDE sampling and GRPO optimization to a sliding window preserves full MDP optimization quality and does not introduce bias or slower convergence outside the window.

What would settle it

A controlled experiment that measures alignment scores and convergence speed when the sliding window is progressively shrunk versus kept at full width on identical base models and datasets.

read the original abstract

Although GRPO substantially enhances flow matching models in human preference alignment of image generation, methods such as FlowGRPO and DanceGRPO still exhibit inefficiency due to the necessity of sampling and optimizing over all denoising steps specified by the Markov Decision Process (MDP). In this paper, we propose $\textbf{MixGRPO}$, a novel framework that leverages the flexibility of mixed sampling strategies through the integration of stochastic differential equations (SDE) and ordinary differential equations (ODE). This streamlines the optimization process within the MDP to improve efficiency and boost performance. Specifically, MixGRPO introduces a sliding window mechanism, using SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This design confines sampling randomness to the time-steps within the window, thereby reducing the optimization overhead, and allowing for more focused gradient updates to accelerate convergence. Additionally, as time-steps beyond the sliding window are not involved in optimization, higher-order solvers are supported for faster sampling. So we present a faster variant, termed $\textbf{MixGRPO-Flash}$, which further improves training efficiency while achieving comparable performance. MixGRPO exhibits substantial gains across multiple dimensions of human preference alignment, outperforming DanceGRPO in both effectiveness and efficiency, with nearly 50% lower training time. Notably, MixGRPO-Flash further reduces training time by 71%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond the standard MDP formulation of GRPO and the known ODE/SDE properties of flow matching; all details remain implicit.

pith-pipeline@v0.9.0 · 5569 in / 1086 out tokens · 50916 ms · 2026-05-13T13:25:55.850740+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.LedgerForcing conservation_from_balance echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    the entire denoising process to be framed as a Markov Decision Process (MDP) in a stochastic environment, where GRPO is then applied to optimize the complete state-action sequence

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

    cs.CV 2026-04 unverdicted novelty 8.0

    OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

  2. CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

    cs.CV 2026-05 conditional novelty 7.0

    CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...

  3. OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.

  4. TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.

  5. TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    TMPO replaces scalar reward maximization with trajectory-level matching to a Boltzmann distribution via Softmax-TB, improving generative diversity by 9.1% while keeping competitive reward performance.

  6. Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline

    cs.AI 2026-05 unverdicted novelty 7.0

    A new adjoint matching framework formulates flow model alignment as optimal control, enabling direct regression training and terminal-trajectory truncation for efficiency gains on models like SiT-XL and FLUX.

  7. ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control

    cs.LG 2026-04 unverdicted novelty 7.0

    ParetoSlider conditions diffusion models on continuous preference weights to approximate the full Pareto front, providing dynamic control over multi-objective rewards at inference time.

  8. Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.

  9. Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.

  10. UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UDM-GRPO is the first RL integration for uniform discrete diffusion models, using final clean samples as actions and forward-process trajectory reconstruction to raise GenEval accuracy from 69% to 96% and OCR accuracy...

  11. LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

    cs.CV 2026-04 unverdicted novelty 7.0

    LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

  12. YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance

    eess.AS 2026-03 unverdicted novelty 7.0

    YingMusic-Singer-Plus is a diffusion model for singing voice synthesis that preserves melody from a reference clip while allowing flexible lyric changes without manual alignment, outperforming Vevo2 and introducing th...

  13. When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

    cs.CV 2026-05 unverdicted novelty 6.0

    Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...

  14. Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

    cs.CV 2026-05 unverdicted novelty 6.0

    Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...

  15. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.

  16. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.

  17. From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data

    cs.CV 2026-05 unverdicted novelty 6.0

    The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world perfor...

  18. POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    POCA combines Pareto optimization with curriculum alignment to improve multi-reward reinforcement learning for visual text generation without relying on weighted sums.

  19. V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

    cs.LG 2026-04 unverdicted novelty 6.0

    V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.

  20. Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Reward Score Matching unifies reward-based fine-tuning for flow and diffusion models by recasting alignment as score matching to a value-guided target.

  21. Region-Constrained Group Relative Policy Optimization for Flow-Based Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    RC-GRPO-Editing constrains GRPO exploration to editing regions via localized noise and attention rewards, improving instruction adherence and non-target preservation in flow-based image editing.

  22. MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.

  23. FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

    cs.LG 2026-04 unverdicted novelty 6.0

    Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.

  24. CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning

    cs.LG 2026-03 unverdicted novelty 6.0

    CellFluxRL post-trains the CellFlux generative model with reinforcement learning driven by biologically meaningful reward functions, yielding virtual cell images that better satisfy physical and biological constraints...

  25. HunyuanVideo 1.5 Technical Report

    cs.CV 2025-11 unverdicted novelty 6.0

    HunyuanVideo 1.5 delivers state-of-the-art open-source text-to-video and image-to-video generation with an 8.3B parameter DiT model featuring SSTA attention, glyph-aware encoding, and progressive training.

  26. Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 5.0

    Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.

  27. A Systematic Post-Train Framework for Video Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.

  28. Reward-Aware Trajectory Shaping for Few-step Visual Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 26 Pith papers · 16 internal anchors

  1. [1]

    Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

    Michael S Albergo, Nicholas M Boffi, and Eric Vanden- Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.arXiv preprint arXiv:2303.08797,

  2. [2]

    Discount factor as a regularizer in reinforcement learning

    Ron Amit, Ron Meir, and Kamil Ciosek. Discount factor as a regularizer in reinforcement learning. InInternational con- ference on machine learning, pages 269–278. PMLR, 2020. 2, 5

  3. [3]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforce- ment learning.arXiv preprint arXiv:2305.13301, 2023. 2

  4. [4]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

  5. [5]

    Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362,

    Ying Fan and Kangwook Lee. Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362,

  6. [6]

    Dpok: Reinforcement learning for fine-tuning text-to-image diffu- sion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffu- sion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023. 2

  7. [7]

    Murphy, and Tim Salimans

    Ruiqi Gao, Emiel Hoogeboom, Jonathan Heek, Valentin De Bortoli, Kevin P. Murphy, and Tim Salimans. Diffusion meets flow matching: Two sides of the same coin. 2024. 3, 6

  8. [8]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2

  9. [9]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 7

  10. [10]

    On the role of discount factor in offline reinforcement learn- ing

    Hao Hu, Yiqin Yang, Qianchuan Zhao, and Chongjie Zhang. On the role of discount factor in offline reinforcement learn- ing. InInternational conference on machine learning, pages 9072–9098. PMLR, 2022. 2, 5

  11. [11]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. 4

  12. [12]

    Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022. 5

  13. [13]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36: 36652–36663, 2023. 2, 5, 6, 7, 3, 4

  14. [14]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 6, 8

  15. [15]

    Aligning Text-to-Image Models using Human Feedback

    Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text- to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023. 2

  16. [16]

    Aes- thetic post-training diffusion models from generic prefer- ences with step-by-step preference optimization

    Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aes- thetic post-training diffusion models from generic prefer- ences with step-by-step preference optimization. InProceed- ings of the Computer Vision and Pattern Recognition Confer- ence, pages 13199–13208, 2025. 2

  17. [17]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 3, 4

  18. [18]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.arXiv preprint arXiv:2505.05470, 2025. 1, 2, 3, 4, 7

  19. [19]

    Improving Video Generation with Human Feedback

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025. 1, 7, 5

  20. [20]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 4, 1

  21. [21]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

  22. [22]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787,

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787,

  23. [23]

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx- uan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.arXiv preprint arXiv:2211.01095, 2022. 2, 6, 8, 1

  24. [24]

    Hpsv3: Towards wide-spectrum human preference score,

    Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score,

  25. [25]

    Reward hacking behavior can generalize across tasks—ai alignment forum

    Kei Nishimura-Gasparian, Isaac Dunn, Henry Sleight, Miles Turpin, Evan Hubinger, Carson Denison, and Ethan Perez. Reward hacking behavior can generalize across tasks—ai alignment forum. InAI Alignment Forum, 2024. 8

  26. [26]

    Stochastic differential equations

    Bernt Øksendal. Stochastic differential equations. In Stochastic differential equations: an introduction with ap- plications, pages 38–50. Springer, 2003. 3, 1

  27. [27]

    Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 1

  28. [28]

    Rethinking the discount factor in reinforcement learning: A decision theoretic approach

    Silviu Pitis. Rethinking the discount factor in reinforcement learning: A decision theoretic approach. InProceedings of the AAAI conference on artificial intelligence, pages 7949– 7956, 2019. 2, 5 10

  29. [29]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6

  30. [30]

    Springer, 1996

    Hannes Risken and Hannes Risken.Fokker-planck equation. Springer, 1996. 3

  31. [31]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1

  32. [32]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 3

  33. [33]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 2

  34. [34]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 1, 4

  35. [35]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 2, 4

  36. [36]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 2, 3, 1

  37. [37]

    Hunyuanvideo 1.5 technical report, 2025

    Tencent Hunyuan Foundation Model Team. Hunyuanvideo 1.5 technical report, 2025. 4

  38. [38]

    Delving into rl for image generation with cot: A study on dpo vs

    Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, and Pheng-Ann Heng. Delving into rl for image generation with cot: A study on dpo vs. grpo.arXiv preprint arXiv:2505.17017, 2025. 2

  39. [39]

    Diffusion model align- ment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024. 2

  40. [40]

    Coefficients-preserving sampling for reinforcement learning with flow matching

    Feng Wang and Zihao Yu. Coefficients-preserving sam- pling for reinforcement learning with flow matching.arXiv preprint arXiv:2509.05952, 2025. 1, 3, 4

  41. [41]

    Unified Reward Model for Multimodal Understanding and Generation

    Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025. 2, 6, 8

  42. [42]

    Reward hacking in reinforcement learning.lil- ianweng.github.io, 2024

    Lilian Weng. Reward hacking in reinforcement learning.lil- ianweng.github.io, 2024. 8

  43. [43]

    RewardDance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

    Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al. Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025. 8

  44. [44]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

  45. [45]

    Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 1, 2, 5, 6, 7, 8, 3, 4

  46. [46]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025. 1, 2, 3, 6, 7, 4

  47. [47]

    Discussion on flow-grpo issue 7

    GitHub User yifan123. Discussion on flow-grpo issue 7. https : / / github . com / yifan123 / flow _ grpo / issues/#issuecomment- 2870678379, 2025. Ac- cessed: 2025-05-12. 2

  48. [48]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 6613–6623, 2024. 3

  49. [49]

    Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Sys- tems, 37:73366–73398, 2024

    Huizhuo Yuan, Zixiang Chen, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Sys- tems, 37:73366–73398, 2024. 2

  50. [50]

    Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.Advances in Neural Information Processing Systems, 36:49842–49869, 2023

    Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.Advances in Neural Information Processing Systems, 36:49842–49869, 2023. 2, 6

  51. [51]

    Dpm- solver-v3: Improved diffusion ode solver with empirical model statistics

    Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Dpm- solver-v3: Improved diffusion ode solver with empirical model statistics. InThirty-seventh Conference on Neural In- formation Processing Systems, 2023. 2, 6

  52. [52]

    Drivinggen: A compre- hensive benchmark for generative video world models in au- tonomous driving.arXiv preprint arXiv:2601.01528, 2026

    Yang Zhou, Hao Shao, Letian Wang, Zhuofan Zong, Hong- sheng Li, and Steven L Waslander. Drivinggen: A compre- hensive benchmark for generative video world models in au- tonomous driving.arXiv preprint arXiv:2601.01528, 2026. 1 11 MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE Supplementary Material

  53. [53]

    (7) has the same convergence as Eq

    Proof of Convergence for Mixed ODE-SDE Sampling To prove that the mixed ODE-SDE sampling method in Eq. (7) has the same convergence as Eq. (2), which uses only ODE sampling, referencing [36], we approach this from the perspective of distribution evolution, where the distribution at each time step,e.g., ∂qt(x) ∂t must be the same. Let the in- terval for SD...

  54. [54]

    We denote the discrete time steps by an index i∈ {0,1,

    DPM-Solver++ for Recitified Flow For clarity and to avoid ambiguity between continuous time and discrete steps, we adopt the following notation in this section. We denote the discrete time steps by an index i∈ {0,1, . . . , T−1}, whereTis the total number of sam- pling steps. The continuous time corresponding to stepiis denoted byt i = i T ∈[0,1). The DPM...

  55. [55]

    MixGRPO-Flash Algorithm MixGRPO-Flash Algorithm 2 accelerates the ODE sam- pling that does not contribute to the calculation of the pol- icy ratio after the sliding window by using DPM-Solver++ in the Eq. (21). We introduce a compression rate˜rsuch that the ODE sampling after the window only requires (T−l−w)˜rtime steps. And the total time-steps is ˜T=l+w...

  56. [56]

    Hybrid Inference for Solving Reward Hacking As discussed in Section 5, reward hacking stems from the limited evaluation capabilities of the reward model. To ad- dress reward hacking and improve visualization, we employ the hybrid inference strategy from [47], which uses the post- trained model for low-SNR (signal-to-noise ratio) steps and the original mod...

  57. [57]

    We established two reciprocal settings to evaluate both in-domain (ID) and out-of-domain (OOD) performance

    Cross-Dataset Experiments To investigate the robustness and parameter sensitivity of the sliding window strategy in MixGRPO, we conducted a series of cross-dataset ablation studies. We established two reciprocal settings to evaluate both in-domain (ID) and out-of-domain (OOD) performance. In cross-dataset exper- 2 iment 1, the model was trained on the HPD...

  58. [58]

    Coefficients-Preserving Sampling In our MixGRPO framework, introducing stochasticity dur- ing the inference phase is crucial for effective exploration in reinforcement learning. While a common practice in- volves the use of Stochastic Differential Equations (SDEs), we adopt Coefficients-Preserving Sampling (CPS) [40] as a more refined alternative to maint...

  59. [59]

    PROMPT: 16-year-old teenager wearing a white bear-ear hat with a smirk on their face

    More Visualized Results 5 FLUX DanceGRPO MixGRPO PROMPT: An image of an aircraft carrier made of cheese. PROMPT: 16-year-old teenager wearing a white bear-ear hat with a smirk on their face. PROMPT: A lemon with a McDonald's hat. FLUX DanceGRPO MixGRPO FLUX DanceGRPO MixGRPO Figure 7. Comparison of the visualization results of FLUX, DanceGRPO, and MixGRPO...