arxiv: 2306.09341 · v2 · submitted 2023-06-15 · 💻 cs.CV · cs.AI· cs.DB

Recognition: 3 theorem links

· Lean Theorem

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu , Yiming Hao , Keqiang Sun , Yixiong Chen , Feng Zhu , Rui Zhao , Hongsheng Li

Authors on Pith no claims yet

Pith reviewed 2026-05-11 08:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.DB

keywords human preference datasettext-to-image synthesisevaluation metricCLIP fine-tuningpreference scoringgenerative model benchmarkimage quality assessmentHPS v2

0 comments

The pith

Fine-tuning CLIP on a large bias-reduced dataset of human image choices creates a scorer that aligns better with human judgments on text-to-image outputs than prior metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper collects HPD v2, a dataset of 798,090 human preference choices across 433,760 image pairs drawn from many sources and prompts chosen to reduce bias. Fine-tuning CLIP on these choices produces HPS v2, a model that scores how well generated images match what people prefer. Experiments show this score generalizes across different image distributions and changes when text-to-image models improve their outputs. A reader would care because current automatic metrics often disagree with human opinion, making it hard to know which generative advances are real. The new scorer therefore offers a more trustworthy way to measure and guide progress in image synthesis.

Core claim

By fine-tuning CLIP on HPD v2, which comprises 798,090 human preference choices on 433,760 pairs of images from diverse sources, we obtain HPS v2 that more accurately predicts human preferences on generated images, generalizes better across various image distributions, and is responsive to algorithmic improvements of text-to-image generative models.

What carries the argument

HPS v2, the scoring model obtained by fine-tuning CLIP on the HPD v2 human preference dataset, used to rank and compare outputs from text-to-image generative models.

If this is right

Allows more reliable comparison of recent text-to-image models from academic, community, and industry sources via a shared benchmark.
Detects when algorithmic changes improve outputs in ways that match human taste rather than proxy scores.
Supports stable, fair, and easy-to-use evaluation by guiding the design of text prompts used during scoring.
Provides a dataset and model that can serve as a drop-in replacement for weaker automatic metrics in research pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Researchers could close the loop by using HPS v2 as a training signal inside generative models instead of only for post-hoc evaluation.
The same preference-collection approach might transfer to related tasks such as text-to-video or image editing where human alignment is also hard to measure.
Widespread adoption could shift model development away from optimizing for FID or CLIP score toward outputs that survive direct human comparison.
Periodic retraining of the scorer on new preference data would be needed to keep pace with rapid changes in generative model capabilities.

Load-bearing premise

The collected human preferences are unbiased and representative enough that fine-tuning CLIP on them produces a scorer that continues to align with human judgments on future unseen models and image distributions.

What would settle it

Gather fresh human preference judgments on images from a new text-to-image model released after HPD v2 collection, then measure whether HPS v2 correlates more strongly with those judgments than earlier metrics such as CLIP score or FID.

read the original abstract

Recent text-to-image generative models can generate high-fidelity images from text inputs, but the quality of these generated images cannot be accurately evaluated by existing evaluation metrics. To address this issue, we introduce Human Preference Dataset v2 (HPD v2), a large-scale dataset that captures human preferences on images from a wide range of sources. HPD v2 comprises 798,090 human preference choices on 433,760 pairs of images, making it the largest dataset of its kind. The text prompts and images are deliberately collected to eliminate potential bias, which is a common issue in previous datasets. By fine-tuning CLIP on HPD v2, we obtain Human Preference Score v2 (HPS v2), a scoring model that can more accurately predict human preferences on generated images. Our experiments demonstrate that HPS v2 generalizes better than previous metrics across various image distributions and is responsive to algorithmic improvements of text-to-image generative models, making it a preferable evaluation metric for these models. We also investigate the design of the evaluation prompts for text-to-image generative models, to make the evaluation stable, fair and easy-to-use. Finally, we establish a benchmark for text-to-image generative models using HPS v2, which includes a set of recent text-to-image models from the academic, community and industry. The code and dataset is available at https://github.com/tgxs002/HPSv2 .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a larger bias-reduced preference dataset and a fine-tuned scorer with public resources, though the generalization results need more visible support.

read the letter

The paper's key move is building Human Preference Dataset v2, which has 798,090 choices on 433,760 image pairs from a range of sources, with deliberate collection to reduce bias. From that they fine-tune CLIP into HPS v2, a model meant to score how well generated images match human preferences. They run experiments claiming better generalization than earlier metrics and sensitivity to model upgrades, plus they set up a benchmark covering academic, community, and industry models. The code and dataset are public. This is useful work because evaluation metrics for text-to-image have been a weak point, and a larger, cleaner preference set helps. Releasing everything means others can build on it or verify the results directly. The part on designing evaluation prompts for stability and fairness is a practical addition that people will appreciate. The soft spots are around the strength of the generalization claim. The abstract states it works across distributions and responds to improvements, but gives no specific numbers, baselines, or stats. If the tests did not include a clear hold-out for models or architectures that came after the data was collected, the scorer might not hold up when new failure modes appear. That matches the stress-test concern about covering future model issues. Overall this is for people in generative AI who need better ways to compare text-to-image outputs without running human studies every time. A reader looking for a ready-to-use metric or a new dataset to train on will find value here. The work shows clear thinking on the problem and honest effort to improve on priors, so it deserves a serious referee. I would send it to peer review, with the expectation that reviewers will ask for more details on the test protocols and any limitations in the data coverage.

Referee Report

3 major / 3 minor

Summary. The paper introduces Human Preference Dataset v2 (HPD v2), comprising 798,090 human preference choices over 433,760 image pairs drawn from diverse text-to-image sources, with deliberate collection to reduce bias. Fine-tuning CLIP on HPD v2 yields Human Preference Score v2 (HPS v2), which the authors claim generalizes better than prior metrics (e.g., CLIP, Aesthetic Score) across image distributions and responds to algorithmic improvements in generative models. The work also examines prompt design for stable evaluation and releases a benchmark ranking recent T2I models from academia, community, and industry.

Significance. If the generalization and responsiveness claims hold under rigorous validation, HPS v2 would supply a human-aligned, practical metric that improves upon distribution-based scores like FID or uncalibrated CLIP similarity for T2I evaluation. The scale of HPD v2 and the public benchmark constitute a concrete resource for the field, provided the scorer's alignment persists on future model families.

major comments (3)

[§4] §4 (Experiments on generalization): The central claim that HPS v2 'generalizes better than previous metrics across various image distributions' is supported only by comparisons on image sets drawn from the same pool of source models used to build HPD v2. No temporal or architectural hold-out is reported in which entire model families (e.g., post-2023 diffusion variants or novel architectures) are excluded from training data yet included in test distributions, leaving the responsiveness-to-improvements result vulnerable to distribution shift.
[§3.2] §3.2 (HPS v2 training) and Table 2: The fine-tuning procedure is described at a high level, but the manuscript provides neither the exact loss formulation, learning-rate schedule, nor ablation on the number of negative pairs per prompt. Without these details it is impossible to assess whether the reported gains over baseline CLIP are due to the preference data itself or to hyper-parameter choices.
[§5] §5 (Benchmark): The ranking of models is presented without error bars, inter-rater agreement statistics on the human labels, or a sensitivity analysis to prompt wording. This weakens the assertion that HPS v2 yields a 'stable, fair and easy-to-use' evaluation protocol.

minor comments (3)

[Abstract / §2.1] The abstract states that HPD v2 'eliminates potential bias' but does not quantify residual prompt or demographic biases; a short paragraph in §2.1 citing the exact collection protocol would clarify this.
[Figure 3] Figure 3 (qualitative examples) lacks axis labels and a legend indicating which images correspond to which model; this reduces readability.
[§3.2] The GitHub link is given, but the manuscript does not specify the exact train/validation split sizes or the random seed used for fine-tuning, hindering reproducibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the paper without altering its core claims.

read point-by-point responses

Referee: [§4] §4 (Experiments on generalization): The central claim that HPS v2 'generalizes better than previous metrics across various image distributions' is supported only by comparisons on image sets drawn from the same pool of source models used to build HPD v2. No temporal or architectural hold-out is reported in which entire model families (e.g., post-2023 diffusion variants or novel architectures) are excluded from training data yet included in test distributions, leaving the responsiveness-to-improvements result vulnerable to distribution shift.

Authors: We appreciate this point on rigorous generalization testing. Our Section 4 evaluations do include image sets from diverse sources such as community fine-tunes and industry models (e.g., Midjourney v5, DALL·E variants) whose outputs were not part of HPD v2 training collection, and HPS v2 shows improved correlation with human preferences on these. However, we agree that explicit architectural and temporal hold-outs would further substantiate the claims. In the revised manuscript, we will add new experiments that exclude specific post-2023 model families from HPS v2 training data and evaluate responsiveness on held-out newer architectures, to be included in an expanded Section 4. revision: yes
Referee: [§3.2] §3.2 (HPS v2 training) and Table 2: The fine-tuning procedure is described at a high level, but the manuscript provides neither the exact loss formulation, learning-rate schedule, nor ablation on the number of negative pairs per prompt. Without these details it is impossible to assess whether the reported gains over baseline CLIP are due to the preference data itself or to hyper-parameter choices.

Authors: We agree that the training details in Section 3.2 are insufficient for full reproducibility and attribution of gains. The current description was kept high-level to focus on the dataset contribution, but this was an oversight. In the revised manuscript, we will expand Section 3.2 and update Table 2 to specify the exact loss (a contrastive pairwise ranking loss on preference pairs), the learning-rate schedule (AdamW with cosine decay, initial LR of 1e-5), and include an ablation on the number of negative pairs per prompt. These additions will demonstrate that performance improvements are driven by HPD v2 rather than hyper-parameters alone. revision: yes
Referee: [§5] §5 (Benchmark): The ranking of models is presented without error bars, inter-rater agreement statistics on the human labels, or a sensitivity analysis to prompt wording. This weakens the assertion that HPS v2 yields a 'stable, fair and easy-to-use' evaluation protocol.

Authors: Thank you for noting these omissions in the benchmark presentation. In the revised Section 5, we will add error bars to the model rankings using bootstrap resampling over evaluation prompts. We will also include a sensitivity analysis varying prompt wording (e.g., adding descriptors or rephrasing) to quantify stability of HPS v2 scores. For inter-rater agreement on the underlying human labels, our collection prioritized scale with single annotations per pair; we will explicitly discuss this as a limitation and note how the dataset size helps average out individual variance. revision: partial

standing simulated objections not resolved

Inter-rater agreement statistics cannot be computed because the HPD v2 collection process used single annotations per image pair to achieve the reported scale of 798k choices.

Circularity Check

0 steps flagged

No significant circularity; empirical training and held-out testing are independent

full rationale

The paper explicitly collects HPD v2 human preference data, fine-tunes CLIP to produce HPS v2, and then reports generalization results on various image distributions. This is standard supervised learning with no self-definitional loop, no fitted parameter renamed as a prediction, and no load-bearing self-citation that reduces the central claim to its own inputs. The generalization experiments are presented as tests on independent distributions rather than tautological outputs of the training process.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pairwise human preference data can be used to fine-tune a vision-language model into a generalizable scorer, plus standard machine-learning assumptions about generalization from training data.

axioms (1)

domain assumption Human preferences over image pairs can be effectively captured and generalized by fine-tuning a pre-trained vision-language model such as CLIP on a large collected dataset.
Invoked when the paper states that fine-tuning CLIP on HPD v2 yields a scoring model that predicts human preferences.

pith-pipeline@v0.9.0 · 5579 in / 1359 out tokens · 70025 ms · 2026-05-11T08:23:18.684603+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.LawOfExistence defect_zero_iff_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our experiments demonstrate that HPS v2 generalizes better than previous metrics across various image distributions and is responsive to algorithmic improvements of text-to-image generative models.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
cs.CV 2026-04 unverdicted novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
Pareto-Guided Optimal Transport for Multi-Reward Alignment
cs.CV 2026-05 unverdicted novelty 7.0

PG-OT builds prompt-specific Pareto frontiers and applies distribution-aware optimal transport to improve multi-reward alignment while introducing JDR and JCR metrics to measure synergy and hacking.
Asymmetric Flow Models
cs.CV 2026-05 unverdicted novelty 7.0

Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...
STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

STRIDE boosts diversity in one-step diffusion models by injecting PCA-aligned pink noise into transformer features while preserving text alignment and quality.
Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
cs.LG 2026-05 unverdicted novelty 7.0

Reinforce Adjoint Matching derives a simple consistency loss for RL post-training of diffusion models by tilting the clean distribution toward higher-reward samples under KL regularization while keeping the noising pr...
ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models
cs.CV 2026-05 unverdicted novelty 7.0

ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.
Attention Sinks in Diffusion Transformers: A Causal Analysis
cs.CV 2026-05 unverdicted novelty 7.0

Suppressing attention sinks in diffusion transformers does not degrade text-image alignment or most preference metrics, revealing a dissociation between generation trajectory changes and semantic output quality.
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
cs.LG 2026-05 unverdicted novelty 7.0

TMPO replaces scalar reward maximization with trajectory-level matching to a Boltzmann distribution via Softmax-TB, improving generative diversity by 9.1% while keeping competitive reward performance.
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
cs.LG 2026-05 unverdicted novelty 7.0

TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.
LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling
cs.CV 2026-05 unverdicted novelty 7.0

LENS shapes low-frequency eigen noise with a lightweight network to enable efficient, high-quality sampling in distilled diffusion models.
Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
cs.LG 2026-04 unverdicted novelty 7.0

FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
cs.CV 2026-04 unverdicted novelty 7.0

Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
cs.CV 2026-04 unverdicted novelty 7.0

Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...
Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation
cs.CV 2026-04 unverdicted novelty 7.0

OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
Depth Adaptive Efficient Visual Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 7.0

DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
Comparison Drives Preference: Reference-Aware Modeling for AI-Generated Video Quality Assessment
cs.CV 2026-04 unverdicted novelty 7.0

RefVQA uses a query-centered reference graph and graph-guided difference aggregation to improve AI-generated video quality assessment by incorporating inter-video comparisons.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
cs.CV 2026-04 unverdicted novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
OneHOI: Unifying Human-Object Interaction Generation and Editing
cs.CV 2026-04 unverdicted novelty 7.0

OneHOI unifies HOI generation and editing in one conditional diffusion transformer using role-aware tokens, structured attention, and joint training on mixed datasets to reach SOTA on both tasks.
SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models
cs.LG 2026-04 unverdicted novelty 7.0

SOAR is a reward-free on-policy method that supplies dense per-timestep supervision to correct exposure bias in diffusion model denoising trajectories, raising GenEval from 0.70 to 0.78 and OCR from 0.64 to 0.67 over ...
RewardFlow: Generate Images by Optimizing What You Reward
cs.CV 2026-04 unverdicted novelty 7.0

RewardFlow unifies differentiable rewards including a new VQA-based one and uses a prompt-aware adaptive policy with Langevin dynamics to achieve state-of-the-art image editing and compositional generation.
Personalizing Text-to-Image Generation to Individual Taste
cs.CV 2026-04 unverdicted novelty 7.0

PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
Hierarchical SVG Tokenization: Learning Compact Visual Programs for Scalable Vector Graphics Modeling
cs.LG 2026-04 unverdicted novelty 7.0

HiVG introduces hierarchical SVG tokenization with atomic and segment tokens plus HMN initialization to enable more efficient and stable autoregressive generation of vector graphics programs.
1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation
cs.CV 2026-04 conditional novelty 7.0

1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.
SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis
cs.CV 2026-03 conditional novelty 7.0

SHARP applies a spectrum-aware dynamic RoPE scaling schedule that promotes resolution more strongly in early denoising stages and relaxes it later, outperforming static baselines on quality metrics for remote sensing images.
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
cs.AI 2025-07 unverdicted novelty 7.0

MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
Unified Reward Model for Multimodal Understanding and Generation
cs.CV 2025-03 unverdicted novelty 7.0

UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling
cs.CV 2026-05 unverdicted novelty 6.0

HeatKV ranks attention heads by their focus on prior scales using offline calibration data and applies a static per-head pruning schedule, delivering 2x higher KV-cache compression than prior methods on the Infinity-2...
Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

Skill-aligned annotation improves inter-annotator agreement and evaluation stability in text-to-image generation compared to uniform annotation baselines.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
cs.CV 2026-05 unverdicted novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

EPIC introduces predicate-guided inference-time search that lifts compositional T2I prompt accuracy from 34% to 71% on GenEval2 with 31-81% lower execution costs.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
cs.CV 2026-05 unverdicted novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
LimeCross: Context-Conditioned Layered Image Editing with Structural Consistency
cs.CV 2026-05 unverdicted novelty 6.0

LimeCross enables text-guided editing of individual layers in composite images by conditioning on cross-layer context via bi-stream attention while preserving layer integrity and introducing the LayerEditBench benchmark.
Attention Sinks in Diffusion Transformers: A Causal Analysis
cs.CV 2026-05 unverdicted novelty 6.0

Suppressing attention sinks in diffusion transformers does not degrade CLIP-T alignment at moderate levels but induces sink-specific perceptual shifts six times larger than equal-budget random masking.
Removing the Watermark Is Not Enough: Forensic Stealth in Generative-AI Watermark Removal
cs.CR 2026-05 unverdicted novelty 6.0

Current AI image watermark removal attacks replace the watermark with a different forensic signal, allowing independent detectors to distinguish processed outputs from clean images at over 98% true-positive rate under...
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
cs.AI 2026-05 unverdicted novelty 6.0

Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...
Flow-OPD: On-Policy Distillation for Flow Matching Models
cs.CV 2026-05 unverdicted novelty 6.0

Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.
Flow-OPD: On-Policy Distillation for Flow Matching Models
cs.CV 2026-05 unverdicted novelty 6.0

Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
Threshold-Guided Optimization for Visual Generative Models
cs.LG 2026-05 unverdicted novelty 6.0

A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.
Advancing Aesthetic Image Generation via Composition Transfer
cs.CV 2026-05 unverdicted novelty 6.0

Composer enables semantic-agnostic composition transfer from references and theme-driven planning via LVLMs to improve aesthetic quality in diffusion-based image generation.
Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.
Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization
cs.CV 2026-04 unverdicted novelty 6.0

Semi-DPO applies semi-supervised learning to noisy preference data in diffusion DPO by training first on consensus pairs then iteratively pseudo-labeling conflicts, yielding state-of-the-art alignment with complex hum...
POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation
cs.CV 2026-04 unverdicted novelty 6.0

POCA combines Pareto optimization with curriculum alignment to improve multi-reward reinforcement learning for visual text generation without relying on weighted sums.
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
cs.LG 2026-04 unverdicted novelty 6.0

V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
cs.CV 2026-04 unverdicted novelty 6.0

By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.
Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models
cs.LG 2026-04 unverdicted novelty 6.0

Reward Score Matching unifies reward-based fine-tuning for flow and diffusion models by recasting alignment as score matching to a value-guided target.
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
cs.CV 2026-04 unverdicted novelty 6.0

VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.
Bias at the End of the Score
cs.CV 2026-04 unverdicted novelty 6.0

Reward models used as quality scorers in text-to-image generation encode demographic biases that cause reward-guided training to sexualize female subjects, reinforce stereotypes, and reduce diversity.
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
Generative Phomosaic with Structure-Aligned and Personalized Diffusion
cs.CV 2026-04 unverdicted novelty 6.0

The paper presents the first generative photomosaic framework that synthesizes tiles via structure-aligned diffusion models and few-shot personalization instead of color-based matching from large tile collections.
MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation
cs.CV 2026-04 unverdicted novelty 6.0

MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
cs.LG 2026-04 unverdicted novelty 6.0

Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
Improving Video Generation with Human Feedback
cs.CV 2025-01 unverdicted novelty 6.0

A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
Towards General Preference Alignment: Diffusion Models at Nash Equilibrium
cs.LG 2026-05 unverdicted novelty 5.0

Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.
A Systematic Post-Train Framework for Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling
cs.AI 2026-04 unverdicted novelty 5.0

DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLH...
Reward-Aware Trajectory Shaping for Few-step Visual Generation
cs.CV 2026-04 unverdicted novelty 5.0

RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 60 Pith papers · 6 internal anchors

[1]

Lawrence Zitnick

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft COCO Captions: Data Collection and Evaluation Server, 2015

work page 2015
[2]

Cogview2: Faster and better text-to-image generation via hierarchical transformers

Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to-image generation via hierarchical transformers. NeurIPS, 35:16890–16902, 2022

work page 2022
[3]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, pages 12873–12883, 2021

work page 2021
[4]

Vector quantized diffusion model for text-to-image synthesis

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In CVPR, pages 10696–10706, 2022

work page 2022
[5]

Detoxify

Laura Hanu and Unitary team. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020

work page 2020
[6]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022

work page 2022
[7]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS, 2017

work page 2017
[8]

Openclip, July 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021. If you use this software, please cite it as below

work page 2021
[9]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[10]

Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation. arXiv preprint arXiv:2305.01569, 2023

work page arXiv 2023
[11]

Aligning Text-to-Image Models using Human Feedback

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023

work page internal anchor Pith review arXiv 2023
[12]

AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment, 2023

Chunyi Li, Zicheng Zhang, Haoning Wu, Wei Sun, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, and Weisi Lin. AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment, 2023

work page 2023
[13]

Fusedream: Training-free text-to-image generation with improved clip+ gan space optimization

Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, and Qiang Liu. Fusedream: Training-free text-to-image generation with improved clip+ gan space optimization. arXiv preprint arXiv:2112.01573, 2021

work page arXiv 2021
[14]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Seeing is not always believing: Benchmarking Human and Model Perception of AI-Generated Images, 2023

Zeyu Lu, Di Huang, Lei Bai, Jingjing Qu, Chengyue Wu, Xihui Liu, and Wanli Ouyang. Seeing is not always believing: Benchmarking Human and Model Perception of AI-Generated Images, 2023

work page 2023
[16]

A V A: A large-scale database for aesthetic visual analysis

Naila Murray, Luca Marchesotti, and Florent Perronnin. A V A: A large-scale database for aesthetic visual analysis. CVPR, pages 2408–2415, 2012

work page 2012
[17]

GLIDE: Towards Photorealistic Image Generation and Editing with Text- Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text- Guided Diffusion Models. In ICML, 2021

work page 2021
[18]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. NeurIPS, 35:27730–27744, 2022

work page 2022
[19]

Simulacra Aesthetic Captions

John David Pressman, Katherine Crowson, and Simulacra Captions Contributors. Simulacra Aesthetic Captions. Technical Report Version 1.0, Stability AI, 2022

work page 2022
[20]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. In ICML, 2021

work page 2021
[21]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents. ArXiv, abs/2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation. ArXiv, abs/2102.12092, 2021

work page internal anchor Pith review arXiv 2021
[23]

Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. CVPR, pages 10674–10685, 2022

work page 2022
[24]

Photorealistic text-to- image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022

work page 2022
[25]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems , 29, 2016

work page 2016
[26]

Generating images of rare concepts using pre-trained diffusion models, 2023

Dvir Samuel, Rami Ben-Ari, Simon Raviv, Nir Darshan, and Gal Chechik. It is all about where you start: Text-to-image generation with seed selection. arXiv preprint arXiv:2304.14530, 2023

work page arXiv 2023
[27]

Sauer, T

Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila. StyleGAN-T: Unlocking the power of gans for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515, 2023

work page arXiv 2023
[28]

LAION-5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022

work page internal anchor Pith review arXiv 2022
[29]

LAION-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R 10 Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text ...

work page 2022
[30]

Proper reuse of image classification features improves object detection

Cristina Vasconcelos, Vighnesh Birodkar, and Vincent Dumoulin. Proper reuse of image classification features improves object detection. In CVPR, pages 13628–13637, 2022

work page 2022
[31]

Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau

Zijie J Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models. arXiv preprint arXiv:2210.14896, 2022

work page arXiv 2022
[32]

Better Aligning Text-to-Image Models with Human Preference, 2023

Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Better Aligning Text-to-Image Models with Human Preference, 2023

work page 2023
[33]

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation, 2023

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation, 2023

work page 2023
[34]

Versatile diffusion: Text, images and variations all in one diffusion model.arXiv preprint arXiv:2211.08332, 2022

Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, and Humphrey Shi. Versatile Diffusion: Text, Images and Variations All in One Diffusion Model. arXiv preprint arXiv:2211.08332, 2022

work page arXiv 2022
[35]

LiT: Zero-Shot Transfer With Locked-Image Text Tuning

Zhai, Xiaohua and Wang, Xiao and Mustafa, Basil and Steiner, Andreas and Keysers, Daniel and Kolesnikov, Alexander and Beyer, Lucas. LiT: Zero-Shot Transfer With Locked-Image Text Tuning. In CVPR, pages 18123–18133, June 2022

work page 2022
[36]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pages 586–595, 2018

work page 2018
[37]

A Perceptual Quality Assessment Exploration for AIGC Images, 2023

Zicheng Zhang, Chunyi Li, Wei Sun, Xiaohong Liu, Xiongkuo Min, and Guangtao Zhai. A Perceptual Quality Assessment Exploration for AIGC Images, 2023

work page 2023
[38]

Hype: A benchmark for human eye perceptual evaluation of generative models

Sharon Zhou, Mitchell Gordon, Ranjay Krishna, Austin Narcomey, Li F Fei-Fei, and Michael Bernstein. Hype: A benchmark for human eye perceptual evaluation of generative models. NeurIPS, 32, 2019

work page 2019
[39]

Laﬁte: Towards language-free training for text-to-image generation

Y Zhou, R Zhang, C Chen, C Li, C Tensmeyer, T Yu, J Gu, J Xu, and T Sun. LAFITE: Towards Language-Free Training for Text-to-Image Generation. arXiv 2021. arXiv preprint arXiv:2111.13792. Checklist The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change...

work page arXiv 2021
[40]

For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper con...

work page
[41]

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

work page
[42]

for benchmarks)

If you ran experiments (e.g. for benchmarks)... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] Please see the supplemental material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes...

work page
[43]

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] We will release our dataset and pre-train mode...

work page
[44]

If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if ap- plicable? [Yes] We will show our instructions given to the workers in the supplemental material. (b) Did you describe any potential participant risks, with links to Institutional Review Boar...

work page
[45]

paintings

If the picture belongs to the style of “paintings”, reply only with “paintings”

work page
[46]

anime and cartoon

If the picture belongs to the style of “anime and cartoon”, reply only with “anime and cartoon”

work page
[47]

real photo

If the picture belongs to the style of “real photo”, reply only with “real photo”

work page
[48]

concept-art

If the picture belongs to the style of “concept-art”, reply only with “concept-art”

work page
[49]

others”; You must reply with only on word. Even though prompts of “Photo

If the picture doesn’t belong to any styles of above, reply only with “others”; You must reply with only on word. Even though prompts of “Photo” category in HPD v2 are from COCO Captions [1], we retain “Photo” in the classification process to mitigate the potential mistakes made by ChatGPT. The category distribution of HPD v2 is illustrated in Fig. 7. Add...

work page
[50]

prompt, Image (A) should take precedence over Image (B)

When Image (A) surpasses Image (B) in terms of aesthetic appeal and fidelity, or Image (B) suffers from severe distortion and blurriness, even if Image (B) aligns better with the 13 (a) (b) Figure 8: Prompt: A pair of skis standing up against a gate. prompt, Image (A) should take precedence over Image (B). For example, in Fig. 8, Fig. 8(b) lacks clear out...

work page
[51]

For example, if you cannot make a choice based on personal preference, as in Fig

When facing a dilemma that images are relatively similar in terms of aesthetics and personal preference, please carefully read and consider the prompt for sorting based more on the text- image alignment. For example, if you cannot make a choice based on personal preference, as in Fig. 10, please pay attention to the description, which refers to a mouse me...

work page
[52]

animation

It is crucial to pay special attention to the capitalized names, as these names may lead to misunderstandings during the machine translation process. If there is any incorrectly translated proprietary term or content you are not familiar with, we recommend you to search for sample images and explanations online. 14 (a) (b) Figure 10: Prompt: A ginger hair...

work page 2023