pith. sign in

arxiv: 2605.16842 · v1 · pith:BYFFEMDJnew · submitted 2026-05-16 · 💻 cs.AI

Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

Pith reviewed 2026-05-19 21:14 UTC · model grok-4.3

classification 💻 cs.AI
keywords reinforcement learningdiffusion modelsmultimodal large language modelshierarchical optimizationimage generationpolicy optimizationcredit assignmentimportance sampling
0
0 comments X

The pith

Hierarchical reinforcement learning with staged sketch-then-paint updates improves how diffusion multimodal models assign credit during image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to solve two core problems in applying reinforcement learning to diffusion-based multimodal large language models: the same final image can arise from many different token unmasking orders, which makes standard importance ratios difficult to compute, and uniform reward signals ignore that early tokens set the global layout while later ones only add details. It introduces HT-GRPO, a method that structures policy updates into three explicit stages—global layout, structural refinement, and final detail painting—while using a prompt-conditioned estimator that begins calculations from a fully masked state and a credit-assignment rule that weights structural tokens more heavily. If these changes work, the resulting images should align better with text prompts and score higher on both automated quality metrics and human judgments without requiring entirely new model architectures.

Core claim

By folding the natural hierarchy of diffusion generation into the RL loop through a Sketch-Then-Paint scheme of global, structure, and refinement stages, plus a prompt-conditioned estimator for importance ratios from a fully masked start and a Hierarchical Credit Assignment mechanism that prioritizes layout tokens, the optimization process propagates rewards more accurately, producing measurable gains on the GenEval and DPG benchmarks as well as on six additional metrics of image quality, aesthetics, and human preference when tested on MMaDA and Lumina-DiMOO backbones.

What carries the argument

HT-GRPO, which organizes policy optimization into a three-stage Sketch-Then-Paint schedule and applies Hierarchical Credit Assignment to weight structural tokens, while computing importance ratios via a prompt-conditioned estimator that starts from a fully masked state.

If this is right

  • Rewards no longer treat every token equally; early layout decisions receive higher effective weight during updates.
  • Importance sampling becomes tractable even when many unmasking paths lead to the same image.
  • Generated images improve on prompt-following benchmarks and on separate measures of aesthetics and human preference.
  • The same training pipeline works across different diffusion MLLM backbones without architecture changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The staged credit scheme may reduce variance in long-horizon RL for any sequential generation task where early choices constrain later ones.
  • Similar hierarchy-aware reward shaping could be tested in non-diffusion autoregressive models that also build outputs in coarse-to-fine order.
  • If the prompt-conditioned estimator proves stable, it might replace more expensive Monte-Carlo rollouts for ratio estimation in other masked generative settings.

Load-bearing premise

The prompt-conditioned estimator correctly computes importance ratios from a fully masked state and the hierarchical credit assignment accurately prioritizes structural tokens without introducing new biases or instabilities in the policy optimization.

What would settle it

Retraining the same models with the three-stage schedule collapsed into a single stage or with the credit-assignment weights removed and observing no improvement or a drop in GenEval and DPG scores relative to the full HT-GRPO version would indicate that the hierarchy and credit mechanisms are not driving the reported gains.

Figures

Figures reproduced from arXiv: 2605.16842 by Guangtao Zhai, Haoxing Chen, Huayu Zheng, Jianghan Shen, Junjun He, Siqi Luo, Xiaohong Liu, Yan Tai, Yihao Liu, Yi Xin, Yue Li, Yuewen Cao.

Figure 1
Figure 1. Figure 1: Comparison of RL inner-loop strategies for dMLLMs. Existing RL approaches generally follow two routes, but both fail to account for this structure: 1) Random remasking methods (shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the image generation process in dMLLMs. Top: early (red, structure) tokens set layout, later (blue, refinement) tokens add detail. Bottom: depict the corresponding outputs. • We show that HT-GRPO consistently enhances MMaDA and Lumina-DiMOO across various benchmarks. This confirms the strong generalization of hierarchical token optimization across dMLLM architectures, outperforming leading… view at source ↗
Figure 3
Figure 3. Figure 3: HT-GRPO framework. (1) Prompt and masked image tokens are rolled out to produce G sample groups. (2) Tokens are partitioned into structural (early, high-entropy) and refinement (later, low-entropy) sets via generation-order rank. (3) The image-level advantage Ag is reweighted by per-token credits wg,i to form the hierarchical advantage A˜ g,i. (4) Inner-loop updates are scheduled into three stages, each ra… view at source ↗
Figure 4
Figure 4. Figure 4: (a) Stage organization and budget allocation: The full Global→Structure→Refinement schedule with a structure-biased 2:4:2 budget consistently outperforms single-stage and two-stage variants. (b) Sensitivity to structure ratio α: Performance peaks at α = 0.3 and degrades when the boundary is too narrow or too broad. (c) Component analysis: Hierarchical Credit Assignment and Prompt-Conditioned Estimator each… view at source ↗
Figure 5
Figure 5. Figure 5: DPG-Bench counting examples on Lumina-DiMOO. For each prompt, we compare four samples per method: MaskGRPO on the left and HT-GRPO on the right. HT-GRPO more consistently preserves the requested object count. A2: Structure-biased Budget Allocation Maximizes Returns [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: DPG-Bench scene-level structural completeness on Lumina-DiMOO. Each prompt compares MaskGRPO and HT-GRPO with four samples per method. HT-GRPO better preserves relative scale, foreground–background structure, and subject pose across complex scene descriptions. dynamic variants. Ascending annealing (γ: 0.5→1.0, 82.41) improves over static baselines but lags behind our method: starting with sparse coverage c… view at source ↗
Figure 7
Figure 7. Figure 7: GenEval qualitative comparison. From left to right: base model, MaskGRPO, and HT-GRPO. HT-GRPO improves visual fidelity, object counting, and spatial relation grounding while producing more natural scene compositions. G Qualitative Results G.1 DPG-Bench: Counting Accuracy [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
read the original abstract

Diffusion Multi-Modal Large Language Models (dMLLMs) are powerful for image generation, but optimizing them through reinforcement learning (RL) remains a major challenge. One primary difficulty is that a single image can be generated through many different unmasking sequences, which makes calculating importance ratios often intractable. Additionally, existing methods tend to ignore the hierarchical generation process of dMLLMs, where early tokens define the global layout and later tokens focus on local details. By assigning uniform rewards to all tokens, these current methods fail to reflect the actual contribution of each token to the final image. To address these issues, we propose Hierarchical Token GRPO (HT-GRPO), which integrates this hierarchy directly into the policy optimization process. Our approach features a Sketch-Then-Paint training scheme that organizes updates into three distinct stages: global, structure, and refinement. We also use a prompt-conditioned estimator to calculate importance ratios starting from a fully masked state. Furthermore, we introduce a Hierarchical Credit Assignment mechanism that prioritizes key structural tokens to ensure accurate reward propagation. Experiments using two popular dMLLM backbones, MMaDA and Lumina-DiMOO, demonstrate that HT-GRPO achieves substantial gains on the GenEval and DPG benchmarks. Evaluations across six additional metrics confirm significant improvements in image quality, aesthetics, and human preference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that RL optimization of diffusion Multi-Modal LLMs is hindered by intractable importance ratios arising from multiple unmasking sequences and by uniform reward assignment that ignores the hierarchical nature of generation (early tokens for global layout, later for details). It introduces Hierarchical Token GRPO (HT-GRPO) featuring a Sketch-Then-Paint scheme with three staged updates (global, structure, refinement), a prompt-conditioned estimator for importance ratios computed from the fully masked state, and a hierarchical credit assignment mechanism that prioritizes structural tokens. Experiments on MMaDA and Lumina-DiMOO backbones report substantial gains on GenEval and DPG plus improvements across six additional metrics for image quality, aesthetics, and human preference.

Significance. If the prompt-conditioned estimator is unbiased and the staged credit assignment accurately reflects token contributions without introducing instabilities, the approach would offer a practical way to incorporate generative hierarchy into policy-gradient methods for diffusion models. This could improve sample efficiency and output quality in dMLLMs and similar sequential generation settings; the explicit staging and credit mechanism constitute a concrete integration of domain structure into RL that is not present in standard GRPO.

major comments (2)
  1. [Methods (estimator definition)] The prompt-conditioned estimator for importance ratios (described in the methods as computing ratios starting from the fully masked image) is introduced without a derivation showing equality to the true ratio summed over all unmasking sequences or that omitted paths have measure zero under the current policy. This directly affects the unbiasedness of the policy-gradient estimator and is therefore load-bearing for the central claim of substantial benchmark gains on GenEval and DPG.
  2. [Methods (credit assignment)] The hierarchical credit assignment is stated to prioritize key structural tokens, yet no explicit formula, ablation, or analysis is supplied demonstrating that it avoids new biases or variance inflation relative to uniform reward assignment. Without this, the reported improvements cannot be confidently attributed to the hierarchy rather than to the staged update schedule alone.
minor comments (2)
  1. [Experiments] The abstract and experiments section should include error bars, number of runs, and baseline comparisons (e.g., standard GRPO, PPO) to allow assessment of statistical significance of the reported gains.
  2. [Methods] Notation for the prompt-conditioned estimator and the three training stages could be made more precise, with explicit pseudocode or equations showing how the stages differ from vanilla GRPO updates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Methods (estimator definition)] The prompt-conditioned estimator for importance ratios (described in the methods as computing ratios starting from the fully masked image) is introduced without a derivation showing equality to the true ratio summed over all unmasking sequences or that omitted paths have measure zero under the current policy. This directly affects the unbiasedness of the policy-gradient estimator and is therefore load-bearing for the central claim of substantial benchmark gains on GenEval and DPG.

    Authors: We appreciate the referee's emphasis on the need for a formal justification of the estimator. The manuscript currently describes the prompt-conditioned estimator as computing ratios from the fully masked state but does not supply the requested derivation. In the revised manuscript we will add a detailed proof in the Methods section (with supporting steps in the appendix) showing that the estimator equals the expectation of the true importance ratio over all unmasking sequences under the policy, and that sequences not originating from the fully masked state have zero measure. This addition will directly address the unbiasedness concern. revision: yes

  2. Referee: [Methods (credit assignment)] The hierarchical credit assignment is stated to prioritize key structural tokens, yet no explicit formula, ablation, or analysis is supplied demonstrating that it avoids new biases or variance inflation relative to uniform reward assignment. Without this, the reported improvements cannot be confidently attributed to the hierarchy rather than to the staged update schedule alone.

    Authors: We agree that the current presentation of the hierarchical credit assignment is insufficiently detailed. The manuscript states that the mechanism prioritizes structural tokens but omits the explicit weighting formula and supporting experiments. In the revision we will insert the precise formula for the stage-dependent credit weights, include an ablation that isolates the credit assignment from the staged update schedule, and provide a short analysis of bias and variance relative to uniform rewards. These additions will strengthen the attribution of gains to the hierarchical component. revision: yes

Circularity Check

0 steps flagged

No circularity: HT-GRPO introduces independent algorithmic components evaluated empirically

full rationale

The paper presents HT-GRPO as a novel integration of staged Sketch-Then-Paint updates and hierarchical credit assignment to handle intractable importance ratios and uniform reward issues in dMLLM RL. These elements are defined as new mechanisms rather than derived from or equivalent to prior fitted parameters, self-referential equations, or self-citations. The prompt-conditioned estimator is introduced as a practical approximation starting from the fully masked state, with no claim that it reduces to the true ratio by construction. Benchmark gains on GenEval and DPG are reported as experimental outcomes, not first-principles predictions forced by the method's own inputs. The derivation chain remains self-contained as an algorithmic proposal without load-bearing reductions to definitions or fits.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that dMLLM generation is meaningfully hierarchical and that a prompt-conditioned estimator can tractably approximate importance ratios; no explicit free parameters, new physical entities, or ad-hoc axioms are detailed in the abstract.

axioms (1)
  • domain assumption Early tokens in dMLLM generation define global layout while later tokens focus on local details
    Invoked to justify the three-stage sketch-then-paint scheme and hierarchical credit assignment.

pith-pipeline@v0.9.0 · 5807 in / 1334 out tokens · 55460 ms · 2026-05-19T21:14:19.995559+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 18 internal anchors

  1. [1]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

  2. [2]

    Lumina-image 2.0: A unified and efficient image generative framework

    Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina-image 2.0: A unified and efficient image generative framework. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2025

  3. [3]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  4. [4]

    LongCat-Image Technical Report

    Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report. arXiv preprint arXiv:2512.07584, 2025

  5. [5]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024

  6. [6]

    Emu3.5: Native Multimodal Models are World Learners

    Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, et al. Emu3.5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

  7. [7]

    Lumina-mgpt 2.0: Stand-alone autoregressive image modeling.arXiv preprint arXiv:2507.17801, 2025

    Yi Xin, Juncheng Yan, Qi Qin, Zhen Li, Dongyang Liu, Shicheng Li, Victor Shea-Jay Huang, Yupeng Zhou, Renrui Zhang, Le Zhuo, et al. Lumina-mgpt 2.0: Stand-alone autoregressive image modeling.arXiv preprint arXiv:2507.17801, 2025

  8. [8]

    Lumina-mgpt: Flexible photorealistic autoregressive text-to-image generation.International Journal of Computer Vision (IJCV), 2026

    Dongyang Liu, Yi Xin, Shitian Zhao, Le Zhuo, Weifeng Lin, Xinyue Li, Qi Qin, Guangtao Zhai, Xiaohong Liu, Hongsheng Li, et al. Lumina-mgpt: Flexible photorealistic autoregressive text-to-image generation.International Journal of Computer Vision (IJCV), 2026

  9. [9]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

  10. [10]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025

  11. [11]

    X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.CoRR, abs/2507.22058, 2025

    Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

  12. [12]

    Ar-grpo: Training autoregressive image generation models via reinforcement learning.arXiv preprint arXiv:2508.06924, 2025

    Shihao Yuan, Yahui Liu, Yang Yue, Jingyuan Zhang, Wangmeng Zuo, Qi Wang, Fuzheng Zhang, and Guorui Zhou. Ar-grpo: Training autoregressive image generation models via reinforcement learning.arXiv preprint arXiv:2508.06924, 2025

  13. [13]

    MMaDA: Multimodal large diffusion language models

    Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. MMaDA: Multimodal large diffusion language models. InAdvances in Neural Information Processing Systems, volume 38, 2025

  14. [14]

    Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding

    Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, et al. Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025

  15. [15]

    Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model

    Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, and Shuicheng Yan. Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model. InInternational Conference on Learning Representations, 2026. 11

  16. [16]

    LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

    Tiwei Bie, Haoxing Chen, Tieyuan Chen, Zhenglin Cheng, Long Cui, Kai Gan, Zhicheng Huang, Zhenzhong Lan, Haoquan Li, Jianguo Li, Tao Lin, Qi Qin, Hongjun Wang, Xiaomei Wang, Haoyuan Wu, Yi Xin, and Junbo Zhao. Llada2.0-uni: Unifying multimodal understanding and generation with diffusion large language model.arXiv preprint arXiv:2604.20796, 2026

  17. [17]

    Lavida-o: Elastic large masked diffusion models for unified multimodal understanding and generation.arXiv preprint arXiv:2509.19244, 2025

    Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, and Jason Kuen. Lavida-o: Elastic large masked diffusion models for unified multimodal understanding and generation.arXiv preprint arXiv:2509.19244, 2025

  18. [18]

    Llada-o: An effective and length-adaptive omni diffusion model.arXiv preprint arXiv:2603.01068,

    Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, and Ji-Rong Wen. Llada-o: An effective and length-adaptive omni diffusion model.arXiv preprint arXiv:2603.01068, 2026

  19. [19]

    Consolidating reinforcement learning for multimodal discrete diffusion models.arXiv preprint arXiv:2510.02880, 2025

    Tianren Ma, Mu Zhang, Yibing Wang, and Qixiang Ye. Consolidating reinforcement learning for multimodal discrete diffusion models.arXiv preprint arXiv:2510.02880, 2025

  20. [20]

    d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

    Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

  21. [21]

    UniGRPO: Unified policy optimization for reasoning-driven visual generation.arXiv preprint arXiv:2603.23500, 2026

    Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, and Wanli Ouyang. UniGRPO: Unified policy optimization for reasoning-driven visual generation.arXiv preprint arXiv:2603.23500, 2026

  22. [22]

    Revolutionizing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949,

    Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, and Mengdi Wang. Revolutioniz- ing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949, 2025

  23. [23]

    Taming masked diffusion language models via consistency trajectory re- inforcement learning with fewer decoding step.arXiv preprint arXiv:2509.23924,

    Jingyi Yang, Guanxu Chen, Xuhao Hu, and Jing Shao. Taming masked diffusion language models via consistency trajectory reinforcement learning with fewer decoding step.arXiv preprint arXiv:2509.23924, 2025

  24. [24]

    d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

    Leyi Pan, Shuchang Tao, Yunpeng Zhai, Zheyu Fu, Liancheng Fang, Minghua He, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Aiwei Liu, and Lijie Wen. d-treerpo: Towards more reliable policy optimization for diffusion language models.arXiv preprint arXiv:2512.09675, 2025

  25. [25]

    Simple policy gradients for reasoning with diffusion language models.arXiv preprint arXiv:2510.04019, 2025

    Anthony Zhan. Simple policy gradients for reasoning with diffusion language models.arXiv preprint arXiv:2510.04019, 2025

  26. [26]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  27. [27]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  28. [28]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Tian, et al. Improving image generation with better captions. Technical Report, OpenAI, 2023

  29. [29]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  30. [30]

    FLUX.1, 2024

    Black Forest Labs. FLUX.1, 2024. Official model release

  31. [31]

    Geneval: An object-focused frame- work for evaluating text-to-image alignment

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused frame- work for evaluating text-to-image alignment. InAdvances in Neural Information Processing Systems, volume 36, pages 76341–76366, 2023

  32. [32]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. ELLA: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. Introduces DPG-Bench. 12

  33. [33]

    Imagereward: learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: learning and evaluating human preferences for text-to-image generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  34. [34]

    Teaching large language models to regress accurate image quality scores using score distribution

    Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  35. [35]

    Hpsv3: Towards wide-spectrum human preference score

    Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score.arXiv preprint arXiv:2508.03789, 2025

  36. [36]

    Unipercept: Towards unified perceptual-level image understanding across aesthetics, quality, structure, and texture.arXiv preprint arXiv:2512.21675, 2025

    Shuo Cao, Jiayang Li, Xiaohui Li, Yuandong Pu, Kaiwen Zhu, Yuanting Gao, Siqi Luo, Yi Xin, Qi Qin, Yu Zhou, Xiangyu Chen, Wenlong Zhang, Bin Fu, Yu Qiao, and Yihao Liu. Unipercept: Towards unified perceptual-level image understanding across aesthetics, quality, structure, and texture.arXiv preprint arXiv:2512.21675, 2025

  37. [37]

    clip-score: CLIP Score for PyTorch

    SUN Zhengwentai. clip-score: CLIP Score for PyTorch. https://github.com/taited/ clip-score, March 2023. Version 0.2.1

  38. [38]

    Unified Reward Model for Multimodal Understanding and Generation

    Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025

  39. [39]

    Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

    Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Struc- tured denoising diffusion models in discrete state-spaces. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  40. [40]

    Discrete diffusion modeling by estimating the ratios of the data distribution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InProceedings of the International Conference on Machine Learning (ICML), 2024

  41. [41]

    Scaling diffusion language models via adaptation from autoregressive models

    Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models. InProceedings of the International Conference on Learning Representations (ICLR), 2025

  42. [42]

    Large Language Diffusion Models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

  43. [43]

    LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

    Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. LLaDA 1.5: Variance-reduced preference optimization for large language diffusion models.arXiv preprint arXiv:2505.19223, 2025

  44. [44]

    LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

    Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. LLaDA-V: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

  45. [45]

    Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model, 2025

    Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, and Shuicheng Yan. Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model, 2025

  46. [46]

    Mmada-parallel: Multimodal large diffusion language models for thinking-aware editing and generation.arXiv preprint arXiv:2511.09611, 2025

    Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian, Jiani Zheng, Haochen Wang, Zhiyang Teng, Zhuochen Wang, Yinjie Wang, Yunhai Tong, Mengdi Wang, and Xiangtai Li. Mmada-parallel: Multimodal large diffusion language models for thinking-aware editing and generation.arXiv preprint arXiv:2511.09611, 2025

  47. [47]

    dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models

    Yi Xin, Siqi Luo, Qi Qin, Haoxing Chen, Kaiwen Zhu, Zhiwei Zhang, Yangfan He, Rongchao Zhang, Jinbin Bai, Shuo Cao, et al. dmllm-tts: Self-verified and efficient test-time scaling for diffusion multi-modal large language models.arXiv preprint arXiv:2512.19433, 2025

  48. [48]

    Accelerating masked image generation by learning latent controlled dynamics.arXiv preprint arXiv:2602.23996, 2025

    Kaiwen Zhu, Quansheng Zeng, Yuandong Pu, Shuo Cao, Xiaohui Li, Yi Xin, Qi Qin, Jiayang Li, Yu Qiao, Jinjin Gu, and Yihao Liu. Accelerating masked image generation by learning latent controlled dynamics.arXiv preprint arXiv:2602.23996, 2025

  49. [49]

    book above laptop

    Shengjun Zhang, Zhang Zhang, Chensheng Dai, and Yueqi Duan. E-GRPO: High entropy steps drive effective reinforcement learning for flow models.arXiv preprint arXiv:2601.00423, 2026. 13 Supplementary Material A Related Work A.1 Diffusion Multi-Modal Large Language Models Discrete diffusion modeling is rapidly emerging as a highly promising paradigm, showing...