CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.
Fine-tuning discrete diffusion models with policy gradient methods
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6roles
background 1polarities
background 1representative citing papers
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
DoMinO reformulates discrete flow matching sampling as an MDP for unbiased RL fine-tuning with new TV regularizers, yielding better enhancer activity and naturalness on DNA design tasks.
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
DiSPO optimizes intermediate decisions in masked diffusion LMs by branching at selected masked states, resampling tokens, scoring completions, and updating only new tokens using a derived policy-gradient estimator that reuses terminal rollouts.