{"paper":{"title":"UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Treating the final clean sample as the action and reconstructing trajectories via the forward process stabilizes reinforcement learning for uniform discrete diffusion models.","cross_cats":["cs.LG"],"primary_cat":"cs.CV","authors_text":"Chengyuan Wang, Fan Zhang, Haoge Deng, Jiaqi Wang, Ting Pan, Xinlong Wang, Yang Liu, Yonggang Qi","submitted_at":"2026-04-20T17:16:50Z","abstract_excerpt":"Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose UDM-GRPO, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"UDM-GRPO significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from 69% to 96% and PickScore increases from 20.46 to 23.81, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from 8% to 57%.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that treating the final clean sample as the action and reconstructing trajectories via the forward process will generalize beyond the specific base models and tasks tested, without introducing new instabilities or overfitting to the chosen benchmarks.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"UDM-GRPO is the first RL integration for uniform discrete diffusion models, using final clean samples as actions and forward-process trajectory reconstruction to raise GenEval accuracy from 69% to 96% and OCR accuracy from 8% to 57%.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Treating the final clean sample as the action and reconstructing trajectories via the forward process stabilizes reinforcement learning for uniform discrete diffusion models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"8300815d552be2b8079a6320c8a64b7c8c9e3e9cb0b674af843939e4f7133238"},"source":{"id":"2604.18518","kind":"arxiv","version":3},"verdict":{"id":"6cd3e333-897e-429e-8682-a7708a097025","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-10T05:28:11.602824Z","strongest_claim":"UDM-GRPO significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from 69% to 96% and PickScore increases from 20.46 to 23.81, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from 8% to 57%.","one_line_summary":"UDM-GRPO is the first RL integration for uniform discrete diffusion models, using final clean samples as actions and forward-process trajectory reconstruction to raise GenEval accuracy from 69% to 96% and OCR accuracy from 8% to 57%.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that treating the final clean sample as the action and reconstructing trajectories via the forward process will generalize beyond the specific base models and tasks tested, without introducing new instabilities or overfitting to the chosen benchmarks.","pith_extraction_headline":"Treating the final clean sample as the action and reconstructing trajectories via the forward process stabilizes reinforcement learning for uniform discrete diffusion models."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2604.18518/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"doi_compliance","ran_at":"2026-05-20T03:55:35.160579Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"050310d1dadabd5ada7bf10bfb96fc10cb2b2fe18f7edde7036b81b4dec15d2b"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}