pith. machine review for the scientific record. sign in

arxiv: 2605.08063 · v3 · submitted 2026-05-08 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Flow-OPD: On-Policy Distillation for Flow Matching Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords flow matchingon-policy distillationtext-to-image alignmentgenerative modelsreinforcement learningmulti-task optimizationmodel distillation
0
0 comments X

The pith

Flow-OPD trains domain-specialized teachers with single-reward GRPO then distills them into one flow-matching student using on-policy sampling and dense supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to solve the seesaw effect in multi-task alignment of flow matching text-to-image models, where scalar rewards cause sparsity and joint optimization creates gradient interference. It does so by first letting each teacher reach its ceiling in isolation through single-reward GRPO, then consolidating their expertise into a student via on-policy sampling, task-routing labeling, and trajectory-level supervision. Manifold Anchor Regularization from a task-agnostic teacher is added to keep generations on a high-quality manifold and prevent aesthetic drop. On Stable Diffusion 3.5 Medium this yields large gains on GenEval and OCR while holding fidelity and human preference scores steady and producing an emergent teacher-surpassing effect.

Core claim

Flow-OPD first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning so each expert reaches its performance ceiling in isolation; it then applies a flow-based cold-start policy and a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision to consolidate heterogeneous expertise into one student, while Manifold Anchor Regularization supplies task-agnostic full-data supervision that anchors output to a high-quality manifold.

What carries the argument

The three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision, together with Manifold Anchor Regularization (MAR) that uses a task-agnostic teacher for full-data supervision.

If this is right

  • GenEval score rises from 63 to 92 and OCR accuracy from 59 to 94 on Stable Diffusion 3.5 Medium.
  • Overall improvement of roughly 10 points over vanilla GRPO while image fidelity and human-preference alignment remain intact.
  • An emergent teacher-surpassing effect appears where the single student outperforms its own teachers on some metrics.
  • The framework scales as a post-training paradigm for building generalist text-to-image models without proportional growth in interference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-stage teacher-then-distill pattern could be tested on diffusion or autoregressive image models that currently rely on joint RL.
  • If the teacher-surpassing effect holds, it suggests distillation can produce capabilities that exceed any single isolated teacher rather than merely averaging them.
  • The cold-start scheme and manifold regularization may generalize to other flow-based or ODE-based generative models facing sparse multi-objective rewards.

Load-bearing premise

Single-reward GRPO fine-tuning lets each domain-specialized teacher reach its performance ceiling in isolation, and the subsequent on-policy orchestration can consolidate their expertise without reintroducing gradient interference or reward hacking.

What would settle it

A head-to-head run in which the Flow-OPD student scores below the average of its individual teachers on the same domain-specific metrics or exhibits renewed reward hacking and metric trade-offs under multi-task evaluation.

Figures

Figures reproduced from arXiv: 2605.08063 by Feng Zhao, Kaituo Feng, Lin Chen, Shaosheng Cao, Shuang Chen, Wenxuan Huang, Yiming Zhao, Yunlong Lin, Yu Zeng, Zehui Chen, Zhen Fang.

Figure 1
Figure 1. Figure 1: Performance Comparison in Multi-task Training. During training, Flow-OPD exhibits a steady increase in mean rewards across GenEval [21] and OCR [22] benchmarks, reaching a peak of 93. In contrast, vanilla GRPO converges prematurely around 78. Our approach significantly outperforms GRPO in both image synthesis and text rendering while maintaining superior generation quality and human preference alignment. T… view at source ↗
Figure 2
Figure 2. Figure 2: Cross-task evaluation of single-reward GRPO. Optimizing with a solitary reward signal [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison between Flow-OPD and various baselines across diverse tasks. Our [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cold-start ablation results. Qualitative results in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative ablation results of Manifold Anchor Regularization. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: We use Qwen3-30B-A3B-Instruct-2507. B More Results B.1 Qualitative results More qualitative results are shown in [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 6
Figure 6. Figure 6: The structured evaluation prompt for Qwenvl Score . [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: More quantitative comparisons on the Pickscore evaluation set. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: More quantitative comparisons on the GenEval evaluation set. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: More quantitative comparisons on the OCR evaluation set. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: More quantitative comparisons with DiffusionNFT [49]. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: More quantitative comparisons with DiffusionNFT [49]. [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
read the original abstract

Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent 'teacher-surpassing' effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models. The codes and weights will be released in: https://github.com/CostaliyA/Flow-OPD .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Flow-OPD, a two-stage on-policy distillation framework for aligning Flow Matching text-to-image models. It first trains domain-specialized teacher models via single-reward GRPO fine-tuning to reach isolated performance ceilings, then uses a Flow-based Cold-Start to initialize a student policy and consolidates expertise through on-policy sampling, task-routing labeling, and dense trajectory-level supervision, augmented by Manifold Anchor Regularization (MAR) to anchor generations to a high-quality manifold. Built on Stable Diffusion 3.5 Medium, the method reports lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 (roughly 10 points above vanilla GRPO) while preserving fidelity and human-preference scores and exhibiting an emergent teacher-surpassing effect.

Significance. If the empirical gains are shown to arise specifically from the proposed consolidation without interference or hacking, Flow-OPD would provide a scalable post-training paradigm for multi-task alignment of flow-based generative models, addressing the seesaw effect that has limited prior RL approaches. The code and weight release supports reproducibility and could influence alignment techniques beyond text-to-image.

major comments (3)
  1. [Abstract and §4 (Experiments)] The central claim of an emergent 'teacher-surpassing' effect (abstract and experiments) rests on the assumption that each single-reward GRPO teacher reaches its performance ceiling in isolation, yet no GenEval or OCR metrics, reward curves, or ceiling comparisons are reported for the individual domain-specialized teachers. Without these, it cannot be determined whether the student gains (GenEval 92, OCR 94) derive from the three-step orchestration or from cold-start, MAR, or extended training.
  2. [§4 (Ablations and Analysis)] No ablation studies isolate the contribution of the three-step consolidation (on-policy sampling, task-routing labeling, dense trajectory-level supervision) versus the Flow-based Cold-Start or MAR; the ~10-point lift over vanilla GRPO is therefore difficult to attribute specifically to the proposed orchestration, leaving the no-gradient-interference assumption load-bearing but untested.
  3. [§4 (Evaluation)] The reported metric lifts lack statistical significance tests, variance across runs, exact baseline implementation details, or evaluation protocol specifications (e.g., prompt sets, sampling steps), which is required to substantiate the soundness of the central empirical claims given the moderate support noted in the review.
minor comments (2)
  1. [§3 (Method)] The mathematical formulation of Manifold Anchor Regularization (MAR) would benefit from an explicit equation or pseudocode in the methods section to clarify how the task-agnostic teacher provides full-data supervision.
  2. [Figures and Tables] Figure captions and tables should explicitly state the number of evaluation prompts and seeds used for GenEval and OCR to improve clarity and reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating planned revisions to strengthen the empirical claims.

read point-by-point responses
  1. Referee: [Abstract and §4 (Experiments)] The central claim of an emergent 'teacher-surpassing' effect (abstract and experiments) rests on the assumption that each single-reward GRPO teacher reaches its performance ceiling in isolation, yet no GenEval or OCR metrics, reward curves, or ceiling comparisons are reported for the individual domain-specialized teachers. Without these, it cannot be determined whether the student gains (GenEval 92, OCR 94) derive from the three-step orchestration or from cold-start, MAR, or extended training.

    Authors: We acknowledge that the individual teacher performance metrics were omitted from the original submission. In the revised manuscript we will add GenEval and OCR scores for each domain-specialized teacher, together with their reward curves over the course of single-reward GRPO fine-tuning. These additions will confirm that the teachers reached their isolated ceilings and thereby clarify that the observed student gains arise from the consolidation stage rather than from cold-start, MAR, or longer training alone. revision: yes

  2. Referee: [§4 (Ablations and Analysis)] No ablation studies isolate the contribution of the three-step consolidation (on-policy sampling, task-routing labeling, dense trajectory-level supervision) versus the Flow-based Cold-Start or MAR; the ~10-point lift over vanilla GRPO is therefore difficult to attribute specifically to the proposed orchestration, leaving the no-gradient-interference assumption load-bearing but untested.

    Authors: We agree that finer-grained ablations are required to attribute gains specifically to the three-step orchestration. In the revision we will introduce new ablation experiments that successively remove on-policy sampling, task-routing labeling, and dense trajectory-level supervision while holding the cold-start initialization and MAR fixed. We note that the components are intentionally interdependent, so perfect isolation is not always possible without altering the core framework; nevertheless, the added results will provide the strongest available evidence for the contribution of the consolidation procedure. revision: partial

  3. Referee: [§4 (Evaluation)] The reported metric lifts lack statistical significance tests, variance across runs, exact baseline implementation details, or evaluation protocol specifications (e.g., prompt sets, sampling steps), which is required to substantiate the soundness of the central empirical claims given the moderate support noted in the review.

    Authors: We will strengthen the evaluation section by adding statistical significance tests (paired t-tests and bootstrap confidence intervals), reporting standard deviations across at least three independent runs with different random seeds, and supplying precise implementation details for all baselines, the exact prompt sets used for GenEval and OCR, and the sampling hyperparameters (steps, guidance scale, etc.). These specifications will appear in the main text and an expanded supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims are empirical

full rationale

The paper proposes Flow-OPD as a two-stage empirical training procedure (domain-specialized teachers via single-reward GRPO, followed by cold-start and three-step on-policy distillation with MAR). All load-bearing claims are performance numbers (GenEval from 63 to 92, OCR from 59 to 94, ~10-point gain over vanilla GRPO) obtained from experiments on Stable Diffusion 3.5 Medium. No mathematical derivation, first-principles prediction, or equation chain is presented that reduces to fitted parameters or self-defined quantities by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the text. The 'teacher-surpassing' effect and 'performance ceiling' statements are empirical assertions whose verification would require additional teacher metrics, but this is a gap in evidence rather than circularity in a derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the effectiveness of single-reward GRPO for creating ceiling-reaching experts and on the ability of on-policy distillation plus MAR to consolidate them without new interference; these are treated as domain assumptions rather than derived results.

axioms (2)
  • domain assumption Single-reward GRPO fine-tuning allows each expert to reach its performance ceiling in isolation
    Invoked in the first stage of the two-stage alignment strategy described in the abstract.
  • domain assumption On-policy distillation with task-routing and dense trajectory supervision can transfer heterogeneous expertise without reintroducing gradient interference
    Central to the second stage of the proposed framework.
invented entities (1)
  • Manifold Anchor Regularization (MAR) no independent evidence
    purpose: Provide task-agnostic full-data supervision to anchor generation to a high-quality manifold and mitigate aesthetic degradation
    Newly introduced regularization term whose independent evidence is limited to the reported empirical improvements.

pith-pipeline@v0.9.0 · 5653 in / 1580 out tokens · 49944 ms · 2026-05-15T05:49:13.205127+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 12 internal anchors

  1. [1]

    Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506, 2025

  2. [2]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  3. [3]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  4. [4]

    Dualvla: Building a generalizable embodied agent via partial decoupling of reasoning and action.arXiv preprint arXiv:2511.22134, 2025

    Zhen Fang, Zhuoyang Liu, Jiaming Liu, Hao Chen, Yu Zeng, Shiting Huang, Zehui Chen, Lin Chen, Shanghang Zhang, and Feng Zhao. Dualvla: Building a generalizable embodied agent via partial decoupling of reasoning and action.arXiv preprint arXiv:2511.22134, 2025

  5. [5]

    Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2026

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2026

  6. [6]

    Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

    Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, et al. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

  7. [7]

    Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

    Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

  8. [8]

    Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025

    Shuang Chen, Yue Guo, Yimeng Ye, Shijue Huang, Wenbo Hu, Haoxi Li, Manyuan Zhang, Jiayu Chen, Song Guo, and Nanyun Peng. Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025

  9. [10]

    Opensearch-vl: An open recipe for frontier multimodal search agents, 2026

    Shuang Chen, Kaituo Feng, Hangting Chen, Wenxuan Huang, Dasen Dai, Quanxin Shou, Yunlong Lin, Xiangyu Yue, Shenghua Gao, and Tianyu Pang. Opensearch-vl: An open recipe for frontier multimodal search agents, 2026

  10. [11]

    Unicorn: Towards self-improving unified multimodal models through self-generated supervision.arXiv preprint arXiv:2601.03193, 2026

    Ruiyan Han, Zhen Fang, XinYu Sun, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, et al. Unicorn: Towards self-improving unified multimodal models through self-generated supervision.arXiv preprint arXiv:2601.03193, 2026

  11. [12]

    Unify-agent: A unified multimodal agent for world-grounded image synthesis.arXiv preprint arXiv:2603.29620, 2026

    Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi- Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, et al. Unify-agent: A unified multimodal agent for world-grounded image synthesis.arXiv preprint arXiv:2603.29620, 2026

  12. [13]

    Gen-Searcher: Reinforcing Agentic Search for Image Generation

    Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, and Xiangyu Yue. Gen-searcher: Reinforcing agentic search for image generation.arXiv preprint arXiv:2603.28767, 2026. 10

  13. [14]

    Interleaving reasoning for better text-to-image generation.arXiv preprint arXiv:2509.06945, 2025

    Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, et al. Interleaving reasoning for better text-to-image generation.arXiv preprint arXiv:2509.06945, 2025

  14. [15]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  15. [17]

    Dancegrpo: Unleashing grpo on visual generation, 2025

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, and Ping Luo. Dancegrpo: Unleashing grpo on visual generation, 2025

  16. [18]

    MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, and Liefeng Bo. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde. arXiv preprint arXiv:2507.21802, 2025

  17. [19]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

  18. [20]

    MiMo-V2-Flash Technical Report

    Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

  19. [21]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  20. [22]

    Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353– 9387, 2023

    Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353– 9387, 2023

  21. [23]

    Training diffusion models with reinforcement learning, 2024

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning, 2024

  22. [24]

    Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models, 2023

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models, 2023

  23. [25]

    Imagereward: learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: learning and evaluating human preferences for text-to-image generation. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 15903–15935, 2023

  24. [26]

    Diffusion model alignment using direct preference optimization, 2023

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization, 2023

  25. [27]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

  26. [28]

    Ar-grpo: Training autoregressive image generation models via reinforcement learning.arXiv preprint arXiv:2508.06924, 2025

    Shihao Yuan, Yahui Liu, Yang Yue, Jingyuan Zhang, Wangmeng Zuo, Qi Wang, Fuzheng Zhang, and Guorui Zhou. Ar-grpo: Training autoregressive image generation models via reinforcement learning.arXiv preprint arXiv:2508.06924, 2025

  27. [29]

    Group critical-token policy optimization for autoregressive image generation.arXiv preprint arXiv:2509.22485, 2025

    Guohui Zhang, Hu Yu, Xiaoxiao Ma, JingHao Zhang, Yaning Pan, Mingde Yao, Jie Xiao, Linjiang Huang, and Feng Zhao. Group critical-token policy optimization for autoregressive image generation.arXiv preprint arXiv:2509.22485, 2025. 11

  28. [30]

    Stage: Stable and generalizable grpo for autoregressive image generation.arXiv preprint arXiv:2509.25027, 2025

    Xiaoxiao Ma, Haibo Qiu, Guohui Zhang, Zhixiong Zeng, Siqi Yang, Lin Ma, and Feng Zhao. Stage: Stable and generalizable grpo for autoregressive image generation.arXiv preprint arXiv:2509.25027, 2025

  29. [31]

    Maskfocus: Focusing policy optimization on critical steps for masked image generation.arXiv preprint arXiv:2512.18766, 2025

    Guohui Zhang, Hu Yu, Xiaoxiao Ma, Yaning Pan, Hang Xu, and Feng Zhao. Maskfocus: Focusing policy optimization on critical steps for masked image generation.arXiv preprint arXiv:2512.18766, 2025

  30. [32]

    MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

    Xiaoxiao Ma, Jiachen Lei, Tianfei Ren, Jie Huang, Siming Fu, Aiming Hao, Jiahong Wu, Xiangxiang Chu, and Feng Zhao. Mar-grpo: Stabilized grpo for ar-diffusion hybrid image generation.arXiv preprint arXiv:2604.06966, 2026

  31. [33]

    Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization, 2026

    Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization, 2026

  32. [34]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

  33. [35]

    Minillm: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe twelfth international conference on learning representations, 2024

  34. [36]

    Distillm-2: A contrastive approach boosts the distillation of llms.arXiv preprint arXiv:2503.07067, 2025

    Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, and Se-Young Yun. Distillm-2: A contrastive approach boosts the distillation of llms.arXiv preprint arXiv:2503.07067, 2025

  35. [37]

    Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

    Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

  36. [38]

    Entropy-aware on-policy distillation of language models

    Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026

  37. [39]

    Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

    Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D Lyng, Sanjit Singh Batra, and Robert E Tillman. Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

  38. [40]

    Paced: Distillation and self-distillation at the frontier of student competence.arXiv e-prints, pages arXiv–2603, 2026

    Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. Paced: Distillation and self-distillation at the frontier of student competence.arXiv e-prints, pages arXiv–2603, 2026

  39. [41]

    On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025

    Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation

  40. [42]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

  41. [43]

    Teaching large language models to regress accurate image quality scores using score distribution.arXiv preprint arXiv:2501.11561, 2025

    Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution.arXiv preprint arXiv:2501.11561, 2025

  42. [44]

    Laion aesthetics, Aug 2022

    Chrisoph Schuhmann. Laion aesthetics, Aug 2022

  43. [45]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024

  44. [46]

    Unified Reward Model for Multimodal Understanding and Generation

    Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025. 12

  45. [47]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

  46. [48]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  47. [49]

    DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025. 13 A More Details Following the data and reward configurations of Flow-GRPO, we conducted multi-task hybrid training for G...

  48. [50]

    •3 (Fair):In focus, adequate lighting, but lacks creativity

    Aesthetic Quality •1-2 (Low):Blurry, poor lighting, or chaotic composition. •3 (Fair):In focus, adequate lighting, but lacks creativity. •4-5 (High):Sharp, vibrant colors, masterful composition and impact

  49. [51]

    •3 (Fair):Partially follows, but distorts some important elements

    Instruction Following •1-2 (Low):Ignores or contradicts the instruction; misses key elements. •3 (Fair):Partially follows, but distorts some important elements. •4-5 (High):Faithful representation of all elements in the prompt

  50. [52]

    winner-takes- all

    Overall Score (Priority: Alignment>Aesthetics) The overall score must primarily reflectInstruction Following. A fair image that perfectly follows the prompt scores higher than a beautiful image that misses it. [EXECUTION RULES] •Strictness:Be rigorous; required details must be explicitly supported. •Reasoning:You MUST analyze keyword-by-keyword in the<Tho...