arxiv: 2505.07818 · v4 · submitted 2025-05-12 · 💻 cs.CV

Recognition: 4 theorem links

· Lean Theorem

DanceGRPO: Unleashing GRPO on Visual Generation

Fangyuan Kong, Jie Wu, Lingting Zhu, Mengzhao Chen, Ping Luo, Qiushan Guo, Weilin Huang, Wei Liu, Yu Gao, Zeyue Xue, Zhiheng Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 22:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords reinforcement learningvisual generationdiffusion modelsrectified flowspolicy optimizationhuman preference alignmentgenerative AIRLHF

0 comments

The pith

DanceGRPO adapts group relative policy optimization to stabilize reinforcement learning for image and video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DanceGRPO as a way to apply Group Relative Policy Optimization to fine-tune generative models so they better match human preferences. Prior reinforcement learning methods for this task tend to become unstable when the number of prompts grows large and varied. By leveraging GRPO's built-in relative comparison within groups, the approach keeps optimization steady across diffusion models and rectified flow models. Experiments show gains on benchmarks that measure aesthetics, text alignment, motion quality, and overall preference scores. If correct, this would let practitioners scale preference alignment to real-world visual tasks without the usual training collapses.

Core claim

DanceGRPO shows that GRPO can be adapted directly to visual generation, delivering consistent policy updates for both diffusion and rectified flow backbones, remaining stable on diverse real-world prompts across three tasks and four base models, and supporting optimization under five different reward signals for aesthetics, alignment, motion, and binary feedback. The method records gains of up to 181 percent on HPS-v2.1, CLIP Score, VideoAlign, and GenEval relative to earlier RL baselines.

What carries the argument

DanceGRPO, the framework that ports GRPO's group-wise relative reward comparisons into the denoising or flow-matching steps of visual generators to replace unstable per-sample policy gradients.

If this is right

Stable updates become possible at the scale of thousands of varied prompts rather than small curated sets.
The same training loop works for both image diffusion models and video flow models without architecture-specific fixes.
A single framework can optimize for multiple independent reward signals covering aesthetics, alignment, and motion quality.
Performance improvements appear on established automatic metrics without requiring new reward model training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The group-relative formulation may reduce the need for large batch sizes or variance-reduction tricks that current visual RL methods require.
If the same stability holds, the approach could be tested on other generative modalities such as audio waveforms or 3D assets.
Practitioners might combine DanceGRPO with existing preference datasets to create more responsive generators for specific domains like medical imaging or animation.
The method opens a route to iterative, on-the-fly preference updates during deployment rather than one-time offline fine-tuning.

Load-bearing premise

That the stability features of GRPO transfer to visual generation without creating fresh failure modes when prompt sets become large and heterogeneous.

What would settle it

Apply DanceGRPO to a prompt set several times larger and more diverse than those tested and check whether the policy gradient variance stays low or rises sharply enough to prevent convergence.

read the original abstract

Recent advances in generative AI have revolutionized visual content creation, yet aligning model outputs with human preferences remains a critical challenge. While Reinforcement Learning (RL) has emerged as a promising approach for fine-tuning generative models, existing methods like DDPO and DPOK face fundamental limitations - particularly their inability to maintain stable optimization when scaling to large and diverse prompt sets, severely restricting their practical utility. This paper presents DanceGRPO, a framework that addresses these limitations through an innovative adaptation of Group Relative Policy Optimization (GRPO) for visual generation tasks. Our key insight is that GRPO's inherent stability mechanisms uniquely position it to overcome the optimization challenges that plague prior RL-based approaches on visual generation. DanceGRPO establishes several significant advances: First, it demonstrates consistent and stable policy optimization across multiple modern generative paradigms, including both diffusion models and rectified flows. Second, it maintains robust performance when scaling to complex, real-world scenarios encompassing three key tasks and four foundation models. Third, it shows remarkable versatility in optimizing for diverse human preferences as captured by five distinct reward models assessing image/video aesthetics, text-image alignment, video motion quality, and binary feedback. Our comprehensive experiments reveal that DanceGRPO outperforms baseline methods by up to 181\% across multiple established benchmarks, including HPS-v2.1, CLIP Score, VideoAlign, and GenEval. Our results establish DanceGRPO as a robust and versatile solution for scaling Reinforcement Learning from Human Feedback (RLHF) tasks in visual generation, offering new insights into harmonizing reinforcement learning and visual synthesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DanceGRPO adapts GRPO to give stable RL fine-tuning for diffusion and rectified-flow visual generators on large prompt sets, with broad tests across models and rewards but outsized gains that need baseline scrutiny.

read the letter

The main point is that DanceGRPO takes Group Relative Policy Optimization and applies it to visual generation, using group-wise advantage estimates on denoising steps or flow trajectories to keep training stable where DDPO and DPOK reportedly collapse on diverse prompts. They demonstrate this on both diffusion models and rectified flows, plus three tasks and four base models. The paper also tests five reward models covering aesthetics, text alignment, motion quality, and binary feedback, which shows the method is not tied to one preference signal. Scaling curves and ablations are included, which helps support the claim that the adaptation fixes the instability without new failure modes. The reported gains reach 181% on benchmarks like HPS-v2.1, CLIP Score, VideoAlign, and GenEval, and the results look consistent with the stated mechanism. The soft spot is the size of those improvements. Large relative gains often trace back to how baselines are implemented or tuned, so the comparisons need to be tight; otherwise the number could overstate the GRPO contribution. The abstract and stress-test note do not mention error bars or run-to-run variance, which would make the stability claims easier to judge. No internal contradictions or circular arguments appear in the full text. This paper is for people who fine-tune generative models with RL and hit scaling problems on real prompt distributions. Readers working on preference alignment for images or video will get concrete implementation details and cross-paradigm evidence. It deserves peer review because it targets a documented practical limitation with experiments that span multiple settings and include the necessary ablations to evaluate the fix.

Referee Report

0 major / 2 minor

Summary. The manuscript presents DanceGRPO, a framework adapting Group Relative Policy Optimization (GRPO) to visual generation. It applies group-wise advantage estimation to denoising steps in diffusion models and trajectories in rectified flows. The central empirical claims are stable policy optimization across diffusion and rectified-flow paradigms, robustness when scaling to large/diverse prompt sets and real-world tasks (three tasks, four foundation models), versatility across five reward models, and outperformance of baselines (DDPO, DPOK) by up to 181% on HPS-v2.1, CLIP Score, VideoAlign, and GenEval.

Significance. If the reported gains and stability hold, the work offers a practical route to scaling RLHF for visual generators, addressing documented instabilities in prior methods. The provision of scaling curves, ablations across models/rewards, and consistent results across paradigms strengthens the contribution and could influence future alignment techniques in image and video synthesis.

minor comments (2)

Abstract: the 'up to 181%' improvement is stated without identifying the exact benchmark, task, model, or reward model on which the maximum is attained.
§4 (Experiments): while scaling curves and ablations are mentioned, the text would benefit from explicit statements on whether error bars or multiple random seeds were used to support claims of 'consistent and stable' performance.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of DanceGRPO and the recommendation for minor revision. The summary accurately captures our contributions regarding stable policy optimization across diffusion and rectified-flow models, robustness across tasks and reward models, and empirical gains over DDPO and DPOK. We will incorporate any minor suggestions into the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical contribution describing an adaptation of Group Relative Policy Optimization (GRPO) to diffusion and rectified-flow models for visual generation. It supplies implementation details (group-wise advantage estimation on denoising steps), scaling experiments across four base models and five reward models, and benchmark comparisons. No derivation chain, first-principles prediction, or fitted parameter is presented that reduces to its own inputs by construction. Claims rest on reported experimental outcomes rather than self-referential equations or load-bearing self-citations that would create circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract, the central claim rests on the unproven transfer of GRPO stability properties to visual generation and on empirical results whose details are not provided.

axioms (1)

domain assumption GRPO's stability mechanisms transfer effectively to visual generation tasks and large prompt sets
This is stated as the key insight but receives no justification or proof in the abstract.

invented entities (1)

DanceGRPO framework no independent evidence
purpose: Adaptation of GRPO for stable visual generation optimization
Newly named method whose implementation details are not visible in the abstract.

pith-pipeline@v0.9.0 · 5608 in / 1342 out tokens · 48998 ms · 2026-05-11T22:23:41.241673+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear
DanceGRPO... adaptation of Group Relative Policy Optimization (GRPO) for visual generation tasks... GRPO’s inherent stability mechanisms uniquely position it to overcome the optimization challenges
Foundation.LawOfExistence defect_zero_iff_one unclear
formulate the denoising process... as a Markov Decision Process... GRPO-style objective... advantage function Ai = ri − mean({r1,...})/std
Foundation.PhiForcing phi_equation unclear
outperforms baseline methods by up to 181% across... HPS-v2.1, CLIP Score, VideoAlign, and GenEval
Foundation.DimensionForcing dimension_forced unclear
unified application framework... across diverse generative paradigms, tasks, foundational models, and reward models

Forward citations

Cited by 49 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
cs.CV 2026-04 unverdicted novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
cs.CV 2026-05 unverdicted novelty 7.0

CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...
Efficient Adjoint Matching for Fine-tuning Diffusion Models
cs.LG 2026-05 unverdicted novelty 7.0

EAM speeds up adjoint matching for diffusion model reward fine-tuning by switching to linear base drift, allowing deterministic few-step solvers and closed-form adjoints with up to 4x faster convergence on text-to-ima...
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
cs.CV 2026-05 unverdicted novelty 7.0

PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
RewardHarness: Self-Evolving Agentic Post-Training
cs.AI 2026-05 unverdicted novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
cs.LG 2026-05 unverdicted novelty 7.0

TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
cs.LG 2026-05 unverdicted novelty 7.0

TMPO replaces scalar reward maximization with trajectory-level matching to a Boltzmann distribution via Softmax-TB, improving generative diversity by 9.1% while keeping competitive reward performance.
Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline
cs.AI 2026-05 unverdicted novelty 7.0

A new adjoint matching framework formulates flow model alignment as optimal control, enabling direct regression training and terminal-trajectory truncation for efficiency gains on models like SiT-XL and FLUX.
Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.
MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery
cs.CV 2026-05 unverdicted novelty 7.0

MotionGRPO models diffusion sampling as a Markov decision process optimized with Group Relative Policy Optimization, using hybrid rewards and noise injection to boost sample diversity and local joint precision in egoc...
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
HP-Edit: A Human-Preference Post-Training Framework for Image Editing
cs.CV 2026-04 unverdicted novelty 7.0

HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation
cs.CV 2026-04 unverdicted novelty 7.0

OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.
Generative Texture Filtering
cs.CV 2026-04 unverdicted novelty 7.0

A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
cs.CV 2026-04 unverdicted novelty 7.0

UDM-GRPO is the first RL integration for uniform discrete diffusion models, using final clean samples as actions and forward-process trajectory reconstruction to raise GenEval accuracy from 69% to 96% and OCR accuracy...
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
cs.CV 2026-04 unverdicted novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
cs.LG 2025-09 unverdicted novelty 7.0

DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-...
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
cs.AI 2025-07 unverdicted novelty 7.0

MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
cs.CV 2026-05 unverdicted novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
cs.CV 2026-05 unverdicted novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data
cs.CV 2026-05 unverdicted novelty 6.0

The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world perfor...
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
cs.CV 2026-05 unverdicted novelty 6.0

DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
Threshold-Guided Optimization for Visual Generative Models
cs.LG 2026-05 unverdicted novelty 6.0

A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.
AesRM: Improving Video Aesthetics with Expert-Level Feedback
cs.CV 2026-04 unverdicted novelty 6.0

AesRM introduces an expert-annotated benchmark and multi-stage trained reward models that outperform baselines in predicting video aesthetic preferences and improve alignment of video generators like Wan2.2.
Leveraging Verifier-Based Reinforcement Learning in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.
ViPO: Visual Preference Optimization at Scale
cs.CV 2026-04 unverdicted novelty 6.0

Poly-DPO improves robustness to noisy preference data in visual models, and the new ViPO dataset enables superior performance, with the method reducing to standard DPO on high-quality data.
POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation
cs.CV 2026-04 unverdicted novelty 6.0

POCA combines Pareto optimization with curriculum alignment to improve multi-reward reinforcement learning for visual text generation without relying on weighted sums.
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
cs.LG 2026-04 unverdicted novelty 6.0

V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models
cs.LG 2026-04 unverdicted novelty 6.0

Reward Score Matching unifies reward-based fine-tuning for flow and diffusion models by recasting alignment as score matching to a value-guided target.
MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation
cs.CV 2026-04 unverdicted novelty 6.0

MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
cs.LG 2026-04 unverdicted novelty 6.0

Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning
cs.LG 2026-03 unverdicted novelty 6.0

CellFluxRL post-trains the CellFlux generative model with reinforcement learning driven by biologically meaningful reward functions, yielding virtual cell images that better satisfy physical and biological constraints...
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
cs.CL 2025-06 conditional novelty 6.0

High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 5.0

Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
cs.CV 2026-05 unverdicted novelty 5.0

DeScore decouples explicit CoT reasoning from reward regression in video reward models via a two-stage cold-start plus dual-objective RL training pipeline.
MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery
cs.CV 2026-05 unverdicted novelty 5.0

MotionGRPO applies GRPO with noise injection and hybrid rewards to diffusion-based egocentric motion recovery, overcoming vanishing gradients from low intra-group diversity to reach state-of-the-art performance.
Towards General Preference Alignment: Diffusion Models at Nash Equilibrium
cs.LG 2026-05 unverdicted novelty 5.0

Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
A Systematic Post-Train Framework for Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
Reward-Aware Trajectory Shaping for Few-step Visual Generation
cs.CV 2026-04 unverdicted novelty 5.0

RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.
RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation
cs.CV 2026-05 unverdicted novelty 4.0

RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.
Seedance 1.0: Exploring the Boundaries of Video Generation Models
cs.CV 2025-06 unverdicted novelty 4.0

Seedance 1.0 generates 5-second 1080p videos in about 41 seconds with claimed superior motion quality, prompt adherence, and multi-shot consistency compared to prior models.
Seedance 2.0: Advancing Video Generation for World Complexity
cs.CV 2026-04 unverdicted novelty 3.0

Seedance 2.0 is an updated multi-modal model for generating 4-15 second audio-video content at 480p/720p with support for up to 3 video, 9 image, and 3 audio references.
Seedream 4.0: Toward Next-generation Multimodal Image Generation
cs.CV 2025-09 unverdicted novelty 3.0

Seedream 4.0 unifies text-to-image synthesis, image editing, and multi-image composition in an efficient diffusion transformer pretrained on billions of pairs and accelerated to 1.8 seconds for 2K output.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 46 Pith papers · 19 internal anchors

[1]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[2]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[3]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Raphael: Text- to-image generation via large mixture of diffusion paths.Advances in Neural Information Processing Systems, 36:41693–41706, 2023

Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, and Ping Luo. Raphael: Text- to-image generation via large mixture of diffusion paths.Advances in Neural Information Processing Systems, 36:41693–41706, 2023

work page 2023
[5]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conference on machine learning, 2024

work page 2024
[8]

Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025

Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025

work page arXiv 2025
[9]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

work page 2023
[10]

Unifl: Improve latent diffusion model via unified feedback learning.arXiv preprint arXiv:2404.05595, 2024

Jiacheng Zhang, Jie Wu, Yuxi Ren, Xin Xia, Huafeng Kuang, Pan Xie, Jiashi Li, Xuefeng Xiao, Weilin Huang, Shilei Wen, et al. Unifl: Improve latent diffusion model via unified feedback learning.arXiv preprint arXiv:2404.05595, 2024

work page arXiv 2024
[11]

Controlnet++: Improving conditional controls with efficient consistency feedback: Project page: liming-ai

Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaoning Wang, Xuefeng Xiao, and Chen Chen. Controlnet++: Improving conditional controls with efficient consistency feedback: Project page: liming-ai. github. io/controlnet_- plus_plus. In European Conference on Computer Vision, pages 129–147. Springer, 2024

work page 2024
[12]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

work page 2024
[13]

Can we generate images with cot? let’s verify and reinforce image generation step by step

Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, and Pheng-Ann Heng. Can we generate images with cot? let’s verify and reinforce image generation step by step.arXiv preprint arXiv:2501.13926, 2025

work page arXiv 2025
[14]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review arXiv 2025
[15]

Onlinevpo: Align video diffusion model with online video-centric preference optimization,

Jiacheng Zhang, Jie Wu, Weifeng Chen, Yatai Ji, Xuefeng Xiao, Weilin Huang, and Kai Han. Onlinevpo: Align video diffusion model with online video-centric preference optimization.arXiv preprint arXiv:2412.15159, 2024

work page arXiv 2024
[16]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998
[17]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. Advancesin Neural Information Processing Systems, 36:79858–79885, 2023

work page 2023
[20]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[24]

Skyreels v1: Human-centric video foundation model.https://github.com/SkyworkAI/SkyReels-V1, 2025

SkyReels-AI. Skyreels v1: Human-centric video foundation model.https://github.com/SkyworkAI/SkyReels-V1, 2025

work page 2025
[25]

Human preference score: Better aligning text- to-image models with human preference.arXiv preprint arXiv:2303.14420, 2023

Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Better aligning text-to-image models with human preference.arXiv preprint arXiv:2303.14420, 1(3), 2023

work page arXiv 2023
[26]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[27]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advancesin Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advancesin Neural Information Processing Systems, 36:52132–52152, 2023

work page 2023
[28]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[29]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

work page 2019
[31]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[32]

Building Normalizing Flows with Stochastic Interpolants

Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants.arXiv preprint arXiv:2209.15571, 2022

work page internal anchor Pith review arXiv 2022
[33]

Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions, 2023.URL https://arxiv. org/abs/2303.08797, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Murphy, and Tim Salimans

Ruiqi Gao, Emiel Hoogeboom, Jonathan Heek, Valentin De Bortoli, Kevin P. Murphy, and Tim Salimans. Diffusion meets flow matching: Two sides of the same coin. 2024

work page 2024
[35]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models

Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. arXiv preprint arXiv:2403.06098, 2024

work page arXiv 2024
[37]

Identity- preserving text-to-video generation by frequency decomposition

Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity- preserving text-to-video generation by frequency decomposition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12978–12988, 2025

work page 2025
[38]

Efficient-vdit: Efficient video diffusion transformers with attention tile, 2025

Hangliang Ding, Dacheng Li, Runlong Su, Peiyuan Zhang, Zhijie Deng, Ion Stoica, and Hao Zhang. Efficient-vdit: Efficient video diffusion transformers with attention tile, 2025

work page 2025
[39]

Fast video generation with sliding tile attention, 2025

Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhenghong Liu, and Hao Zhang. Fast video generation with sliding tile attention, 2025. 12

work page 2025
[40]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

work page 2023
[41]

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation,

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. arXiv preprint arXiv:2406.15252, 2024

work page arXiv 2024
[42]

Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation,

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. arXiv preprint arXiv:2412.21059, 2024

work page arXiv 2024
[43]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

A survey on evaluation of large language models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

work page 2024
[45]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020

work page 1901
[46]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesin neural information processing systems, 35:27730–27744, 2022

work page 2022
[49]

Rlaif: Scaling reinforcement learning from human feedback with ai feedback

Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Ren Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. 2023

work page 2023
[50]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023

work page 2023
[51]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

Direct preference optimization: Your language model is secretly a reward model.Advancesin Neural Information Processing Systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advancesin Neural Information Processing Systems, 36:53728–53741, 2023

work page 2023
[53]

Step-aware preference optimization: Aligning preference with denoising performance at each step.arXiv preprint arXiv:2406.04314, 2024

Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Ji Li, and Liang Zheng. Step-aware prefer- ence optimization: Aligning preference with denoising performance at each step.arXiv preprint arXiv:2406.04314, 2(5):7, 2024

work page arXiv 2024
[54]

Learning multi-dimensional human prefer- ence for text-to-image generation

Tao Zhang, Cheng Da, Kun Ding, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, and Chunhong Pan. Diffusion model as a noise-aware latent reward model for step-level preference optimization.arXiv preprint arXiv:2502.01051, 2025

work page arXiv 2025
[55]

Prabhudesai, M., Goyal, A., Pathak, D., and Fragkiadaki, K

Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Katerina Fragkiadaki, and Deepak Pathak. Video diffusion alignment via reward gradients.arXiv preprint arXiv:2407.08737, 2024

work page arXiv 2024
[56]

Janusflow: Harmonizing autoregres- sion and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Liang Zhao, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024. 13

work page arXiv 2024
[57]

A photo of cup

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14398–14409, 2024. 14 Appendix A Experimental Settings We provide detailed exper...

work page 2024