hub Canonical reference

Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping

Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, et al · 2025 · arXiv 2510.22319

Canonical reference. 100% of citing Pith papers cite this work as background.

15 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 6

representative citing papers

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

DRM: Diffusion-based Reward Model With Step-wise Guidance

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

DRM turns a pre-trained diffusion model into a step-wise reward model and uses it for dense RL training (Step-wise GRPO) and guided sampling to improve final image quality.

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

cs.CV · 2026-05-14 · conditional · novelty 7.0

CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.

Efficient Adjoint Matching for Fine-tuning Diffusion Models

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

EAM reformulates adjoint matching for diffusion fine-tuning with linear base drift to allow efficient deterministic sampling and closed-form adjoints while matching or exceeding prior performance.

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.

Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

cs.CV · 2026-02-11 · unverdicted · novelty 7.0

DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.

When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.

V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

cs.LG · 2026-04-25 · unverdicted · novelty 6.0

V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.

Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

cs.LG · 2026-02-04 · conditional · novelty 6.0

An ELBO-based likelihood estimator from the final generated sample dominates other RL design factors for diffusion models, raising GenEval from 0.24 to 0.95 in 90 GPU hours with better efficiency than prior methods.

Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

cs.LG · 2026-05-22 · unverdicted · novelty 5.0

Precise is a new SDE-consistent stochastic sampler that balances exploration and stability for RL post-training of flow-matching models via a novel posterior-mean approximation.

Embedding-perturbed Exploration Preference Optimization for Flow Models

cs.CV · 2026-05-15 · unverdicted · novelty 5.0

E²PO uses embedding-level perturbations to maintain intra-group variance and discriminative signal in RL-based preference optimization for generative flow models.

Qwen-Image-2.0 Technical Report

cs.CV · 2026-05-11 · unverdicted · novelty 4.0

Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.

Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models

cs.LG · 2026-04-19

citing papers explorer

Showing 15 of 15 citing papers.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models cs.CV · 2026-04-05 · unverdicted · none · ref 38
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
DRM: Diffusion-based Reward Model With Step-wise Guidance cs.CV · 2026-05-25 · unverdicted · none · ref 37
DRM turns a pre-trained diffusion model into a step-wise reward model and uses it for dense RL training (Step-wise GRPO) and guided sampling to improve final image quality.
CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL cs.CV · 2026-05-14 · conditional · none · ref 35
CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.
Efficient Adjoint Matching for Fine-tuning Diffusion Models cs.LG · 2026-05-12 · unverdicted · none · ref 33 · 2 links
EAM reformulates adjoint matching for diffusion fine-tuning with linear base drift to allow efficient deterministic sampling and closed-form adjoints while matching or exceeding prior performance.
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment cs.LG · 2026-05-09 · unverdicted · none · ref 19 · 2 links
TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning cs.LG · 2026-04-21 · unverdicted · none · ref 51
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling cs.CV · 2026-02-11 · unverdicted · none · ref 39
DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy cs.CV · 2026-05-12 · unverdicted · none · ref 61
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping cs.CV · 2026-05-11 · unverdicted · none · ref 22
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think cs.LG · 2026-04-25 · unverdicted · none · ref 40
V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design cs.LG · 2026-02-04 · conditional · none · ref 21
An ELBO-based likelihood estimator from the final generated sample dominates other RL design factors for diffusion models, raising GenEval from 0.24 to 0.95 in 90 GPU hours with better efficiency than prior methods.
Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models cs.LG · 2026-05-22 · unverdicted · none · ref 16
Precise is a new SDE-consistent stochastic sampler that balances exploration and stability for RL post-training of flow-matching models via a novel posterior-mean approximation.
Embedding-perturbed Exploration Preference Optimization for Flow Models cs.CV · 2026-05-15 · unverdicted · none · ref 75
E²PO uses embedding-level perturbations to maintain intra-group variance and discriminative signal in RL-based preference optimization for generative flow models.
Qwen-Image-2.0 Technical Report cs.CV · 2026-05-11 · unverdicted · none · ref 25
Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models cs.LG · 2026-04-19 · unreviewed · ref 47

Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer