UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

Chengyuan Wang; Fan Zhang; Haoge Deng; Jiaqi Wang; Ting Pan; Xinlong Wang; Yang Liu; Yonggang Qi

arxiv: 2604.18518 · v4 · pith:B7QFPS5Jnew · submitted 2026-04-20 · 💻 cs.CV · cs.LG

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

Jiaqi Wang , Haoge Deng , Ting Pan , Yang Liu , Chengyuan Wang , Fan Zhang , Yonggang Qi , Xinlong Wang This is my paper

Pith reviewed 2026-05-10 05:28 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords uniform discrete diffusion modelsgroup relative policy optimizationreinforcement learningtext-to-image generationtraining stabilitydiscrete generative modelingpolicy optimization

0 comments

The pith

Treating the final clean sample as the action and reconstructing trajectories via the forward process stabilizes reinforcement learning for uniform discrete diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a way to combine reinforcement learning with uniform discrete diffusion models for text-to-image generation without the instability seen in naive applications. It does so by redefining the RL action as the final clean image and rebuilding training trajectories through the diffusion forward process to keep them aligned with pretraining distributions. Two efficiency techniques, Reduced-Step and CFG-Free training, are added to speed up the process. The result is large gains across benchmarks, with the method reaching state-of-the-art in both continuous and discrete settings. A sympathetic reader would care because this opens stable RL fine-tuning for discrete generative models that previously resisted it.

Core claim

UDM-GRPO is the first framework to integrate uniform discrete diffusion models with group relative policy optimization. It treats the final clean sample as the RL action to supply accurate optimization signals and reconstructs trajectories via the diffusion forward process to align probability paths with the pretraining distribution. The Reduced-Step and CFG-Free strategies further improve efficiency, producing stable training and large performance lifts on text-to-image tasks.

What carries the argument

The action definition as the final clean sample paired with forward-process trajectory reconstruction, which supplies stable RL signals aligned to the pretraining distribution.

Load-bearing premise

That defining the final clean sample as the action and rebuilding trajectories forward will yield stable optimization signals that generalize across base models and tasks without new instabilities or benchmark overfitting.

What would settle it

Running UDM-GRPO on a different uniform discrete diffusion base model outside the original experiments and checking whether training stability and benchmark gains hold.

Figures

Figures reproduced from arXiv: 2604.18518 by Chengyuan Wang, Fan Zhang, Haoge Deng, Jiaqi Wang, Ting Pan, Xinlong Wang, Yang Liu, Yonggang Qi.

**Figure 1.** Figure 1: Reward–step training curve. The baseline suffers from optimization collapse after 500 steps, characterized by violent reward oscillation and exploding KL divergence. In contrast, our UDM-GRPO achieves stable convergence with sustained reward improvement and bounded KL loss. generation. By using parallel token updates and progressive refinement, it outperforms traditional mask-based methods (Xie et al., 2… view at source ↗

**Figure 2.** Figure 2: Illustration of the three trajectories. Xbackward denoises x0 via the reverse process to obtain xˆ1. In contrast, Xforward and Xpretrain share the same forward diffusion process but differ in their clean sources: xˆ1 for Xforward and x1 from the pretraining dataset for Xpretrain, resulting in xˆt and xt, respectively. solver with a two-stage conditional sampling scheme for efficient generation (Shaul et a… view at source ↗

**Figure 3.** Figure 3: Overview of UDM-GRPO. Given a prompt, we first sample G clean images xˆ1 using the reverse process of UDM. To solve the instability caused by directly using this Xbackward as trajectory and x t 1 as action, we construct the training trajectory Xforward by perturbing xˆ1 with forward process at different timesteps. Then we use Xforward as trajectory and xˆ1 as action to calculate the transition probability … view at source ↗

**Figure 4.** Figure 4: (i) The entropy of pθ(· | xt) along the Xbackward trajectory, and the FID between Xbackward and Xpretrain as well as between Xforward and Xpretrain at different denoising timesteps (top). (ii) Visual comparison of the predicted x t 1 images: Xbackward (first row), Xpretrain (second row), and Xforward (third row). Problem II: Biased Distribution of Backward Trajectory. During pretraining, the model is train… view at source ↗

**Figure 5.** Figure 5: Qualitative Comparison. We evaluate our model against SD3.5-L, Flux.1 Dev and URSA using prompts from GenEval and PickScore, respectively. initially degrades generation quality, the effect is transient: as training progresses, the model recovers and ultimately surpasses conventional CFG-based methods ( [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 7.** Figure 7: Qualitative Comparison. We compare different methods for integrating GRPO into our base model. From left to right, the results correspond to (a): backward + x t 1, (b): backward + xˆ1, (c): forward + xˆ1, and (d): forward + xˆ1 + CFG-free. CFG-free. vs. CFG. We compare forward optimization with and without classifier-free guidance (CFG). As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative Comparison. The prompts are taken from GenEval, PickScore respectively, where we compare the SD3.5-L and Flux.1 Dev with our model. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization for different method. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: We visualize the generated samples across successive training iterations during the optimization. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

read the original abstract

Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose UDM-GRPO, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. UDM-GRPO significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from $69\%$ to $96\%$ and PickScore increases from $20.46$ to $23.81$, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from $8\%$ to $57\%$, further validating the generalization ability of our method. Code is available at https://github.com/Yovecent/UDM-GRPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UDM-GRPO stabilizes GRPO for uniform discrete diffusion by using the clean sample as action and forward-process trajectory reconstruction, with large reported benchmark gains but unclear split between core method and auxiliary tricks.

read the letter

The punchline is that this paper makes group relative policy optimization work reliably on uniform discrete diffusion models. They do it by redefining the action as the final clean sample and reconstructing the trajectory by running the forward diffusion process from there, along with two efficiency strategies. What the paper does well is identify the instability problem with naive application and offer targeted fixes that align the RL optimization with the pretraining distribution. The results are eye-catching: GenEval accuracy goes from 69% to 96%, PickScore from 20.46 to 23.81, and OCR from 8% to 57%. They achieve state-of-the-art in both continuous and discrete settings, and the code is public, which lets others test the claims directly. The soft spots are around how much of the gain comes from the main ideas versus the Reduced-Step and CFG-Free additions. The abstract does not provide detailed ablations, so it's possible the trajectory reconstruction introduces subtle biases in the discrete probability paths at different timesteps. The concern that this might not generalize or could overfit to the specific benchmarks is reasonable until more varied experiments are shown. Nothing in the provided details contradicts the claims, but the evidence is mostly empirical deltas rather than deeper analysis of why the gradients stay unbiased. This work is aimed at researchers in computer vision and generative modeling who are trying to improve discrete diffusion with reinforcement learning techniques. Readers who care about controllable text-to-image generation or extending RL to other discrete domains like text or molecules would find it relevant. It has enough new empirical ground and practical value to deserve a serious referee, even if some methodological details need tightening. I recommend sending it for peer review.

Referee Report

3 major / 2 minor

Summary. The paper proposes UDM-GRPO, the first integration of Uniform Discrete Diffusion Models (UDM) with Group Relative Policy Optimization (GRPO) for RL-based fine-tuning in text-to-image generation. It identifies instability in naive GRPO application to UDM and introduces two insights—treating the final clean sample as the action and reconstructing full trajectories via the diffusion forward process—plus Reduced-Step and CFG-Free heuristics for efficiency. Empirical results claim large gains: GenEval accuracy from 69% to 96%, PickScore from 20.46 to 23.81, and OCR accuracy from 8% to 57%, achieving SOTA in both continuous and discrete settings. Code is released.

Significance. If the reported gains prove robust, attributable to the GRPO adaptation itself, and generalizable beyond the tested base models and benchmarks, the work would offer a practical route to stable RL for discrete diffusion, with potential impact on controllable generation. The code release aids reproducibility, but the absence of formal analysis for the trajectory reconstruction and limited visibility into ablations reduce the strength of the contribution relative to purely empirical claims.

major comments (3)

[Abstract / Experiments] Abstract and Experiments: The large deltas (GenEval 69%→96%, OCR 8%→57%) are presented without reported variance, number of runs, or statistical significance tests. This is load-bearing because the central claim of 'stable and efficient' GRPO integration rests on these improvements being reliable rather than sensitive to hyperparameter choices or baseline weaknesses.
[Method] Method (key insights section): No formal argument or empirical isolation is provided showing that treating the final clean sample as the single action plus forward-process trajectory reconstruction yields unbiased gradients in discrete token space or stays inside the pretraining marginals. This directly underpins the stability claim and the attribution of gains to GRPO rather than the auxiliary Reduced-Step / CFG-Free strategies.
[Experiments] Experiments: No ablation tables or controlled comparisons isolate the GRPO component from the Reduced-Step and CFG-Free heuristics. Without these, it is impossible to confirm that the performance attribution does not collapse to the heuristics, as flagged by the weakest assumption in the stress-test note.

minor comments (2)

[Abstract] The abstract states 'achieving state-of-the-art performance in both continuous and discrete settings' but does not name the specific continuous baselines or discrete competitors used for the SOTA claim.
[Method] Notation for the action definition and trajectory reconstruction (e.g., how the forward process is applied to the clean sample) should be introduced with explicit equations early in the method section for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on statistical reporting, methodological justification, and ablation clarity. We address each point below and have revised the manuscript to strengthen the empirical robustness and attribution of results.

read point-by-point responses

Referee: [Abstract / Experiments] The large deltas (GenEval 69%→96%, OCR 8%→57%) are presented without reported variance, number of runs, or statistical significance tests. This is load-bearing because the central claim of 'stable and efficient' GRPO integration rests on these improvements being reliable rather than sensitive to hyperparameter choices or baseline weaknesses.

Authors: We agree that variance and statistical significance are essential for validating reliability. In the revised manuscript, we now report all key metrics (GenEval, PickScore, OCR) as averages over 3 independent runs with different random seeds, including standard deviations. We also added paired t-test p-values (p < 0.01 for primary gains) in the Experiments section and updated Tables 1–2. These additions confirm the improvements are robust and not due to single-run variance. revision: yes
Referee: [Method] No formal argument or empirical isolation is provided showing that treating the final clean sample as the single action plus forward-process trajectory reconstruction yields unbiased gradients in discrete token space or stays inside the pretraining marginals. This directly underpins the stability claim and the attribution of gains to GRPO rather than the auxiliary Reduced-Step / CFG-Free strategies.

Authors: We acknowledge the absence of a complete formal proof of unbiasedness, which is challenging in discrete diffusion and remains an open theoretical question. However, we have added an empirical isolation analysis in the revised paper: gradient variance and KL-divergence to pretraining marginals are compared with and without the final-clean-sample action and forward reconstruction. Results show substantially lower variance and better marginal alignment. A brief derivation is included in Appendix B showing path consistency with the forward process. This supports stability attribution to the core GRPO insights. revision: partial
Referee: [Experiments] No ablation tables or controlled comparisons isolate the GRPO component from the Reduced-Step and CFG-Free heuristics. Without these, it is impossible to confirm that the performance attribution does not collapse to the heuristics, as flagged by the weakest assumption in the stress-test note.

Authors: We agree that component isolation is necessary. The revised manuscript includes a new ablation table (Table 3) evaluating: base UDM, base + Reduced-Step only, base + CFG-Free only, base + GRPO (key insights, no heuristics), and full UDM-GRPO. The GRPO component alone accounts for the majority of gains (e.g., GenEval rising to 92%), while heuristics primarily boost efficiency with little effect on final accuracy. This clarifies that performance is not reducible to the heuristics. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical method with benchmark gains

full rationale

The paper proposes UDM-GRPO as an algorithmic integration of GRPO with uniform discrete diffusion models, motivated by two empirical insights about action definition and trajectory reconstruction. These are presented as design choices that improve stability, followed by reported benchmark lifts (GenEval 69%→96%, OCR 8%→57%). No equations, uniqueness theorems, or self-citations are invoked to derive the performance claims; the central results are measured outcomes on held-out tasks rather than quantities forced by fitting or redefinition. The derivation chain is therefore self-contained as an engineering contribution whose validity rests on external evaluation, not internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No new mathematical axioms, free parameters specific to the claim, or invented entities are introduced; the work applies existing GRPO and UDM components with empirical modifications whose validity rests on experimental outcomes rather than additional postulates.

pith-pipeline@v0.9.0 · 5543 in / 1155 out tokens · 29853 ms · 2026-05-10T05:28:11.602824+00:00 · methodology

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)