Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis
Pith reviewed 2026-05-11 08:23 UTC · model grok-4.3
The pith
Fine-tuning CLIP on a large bias-reduced dataset of human image choices creates a scorer that aligns better with human judgments on text-to-image outputs than prior metrics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By fine-tuning CLIP on HPD v2, which comprises 798,090 human preference choices on 433,760 pairs of images from diverse sources, we obtain HPS v2 that more accurately predicts human preferences on generated images, generalizes better across various image distributions, and is responsive to algorithmic improvements of text-to-image generative models.
What carries the argument
HPS v2, the scoring model obtained by fine-tuning CLIP on the HPD v2 human preference dataset, used to rank and compare outputs from text-to-image generative models.
If this is right
- Allows more reliable comparison of recent text-to-image models from academic, community, and industry sources via a shared benchmark.
- Detects when algorithmic changes improve outputs in ways that match human taste rather than proxy scores.
- Supports stable, fair, and easy-to-use evaluation by guiding the design of text prompts used during scoring.
- Provides a dataset and model that can serve as a drop-in replacement for weaker automatic metrics in research pipelines.
Where Pith is reading between the lines
- Researchers could close the loop by using HPS v2 as a training signal inside generative models instead of only for post-hoc evaluation.
- The same preference-collection approach might transfer to related tasks such as text-to-video or image editing where human alignment is also hard to measure.
- Widespread adoption could shift model development away from optimizing for FID or CLIP score toward outputs that survive direct human comparison.
- Periodic retraining of the scorer on new preference data would be needed to keep pace with rapid changes in generative model capabilities.
Load-bearing premise
The collected human preferences are unbiased and representative enough that fine-tuning CLIP on them produces a scorer that continues to align with human judgments on future unseen models and image distributions.
What would settle it
Gather fresh human preference judgments on images from a new text-to-image model released after HPD v2 collection, then measure whether HPS v2 correlates more strongly with those judgments than earlier metrics such as CLIP score or FID.
read the original abstract
Recent text-to-image generative models can generate high-fidelity images from text inputs, but the quality of these generated images cannot be accurately evaluated by existing evaluation metrics. To address this issue, we introduce Human Preference Dataset v2 (HPD v2), a large-scale dataset that captures human preferences on images from a wide range of sources. HPD v2 comprises 798,090 human preference choices on 433,760 pairs of images, making it the largest dataset of its kind. The text prompts and images are deliberately collected to eliminate potential bias, which is a common issue in previous datasets. By fine-tuning CLIP on HPD v2, we obtain Human Preference Score v2 (HPS v2), a scoring model that can more accurately predict human preferences on generated images. Our experiments demonstrate that HPS v2 generalizes better than previous metrics across various image distributions and is responsive to algorithmic improvements of text-to-image generative models, making it a preferable evaluation metric for these models. We also investigate the design of the evaluation prompts for text-to-image generative models, to make the evaluation stable, fair and easy-to-use. Finally, we establish a benchmark for text-to-image generative models using HPS v2, which includes a set of recent text-to-image models from the academic, community and industry. The code and dataset is available at https://github.com/tgxs002/HPSv2 .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Human Preference Dataset v2 (HPD v2), comprising 798,090 human preference choices over 433,760 image pairs drawn from diverse text-to-image sources, with deliberate collection to reduce bias. Fine-tuning CLIP on HPD v2 yields Human Preference Score v2 (HPS v2), which the authors claim generalizes better than prior metrics (e.g., CLIP, Aesthetic Score) across image distributions and responds to algorithmic improvements in generative models. The work also examines prompt design for stable evaluation and releases a benchmark ranking recent T2I models from academia, community, and industry.
Significance. If the generalization and responsiveness claims hold under rigorous validation, HPS v2 would supply a human-aligned, practical metric that improves upon distribution-based scores like FID or uncalibrated CLIP similarity for T2I evaluation. The scale of HPD v2 and the public benchmark constitute a concrete resource for the field, provided the scorer's alignment persists on future model families.
major comments (3)
- [§4] §4 (Experiments on generalization): The central claim that HPS v2 'generalizes better than previous metrics across various image distributions' is supported only by comparisons on image sets drawn from the same pool of source models used to build HPD v2. No temporal or architectural hold-out is reported in which entire model families (e.g., post-2023 diffusion variants or novel architectures) are excluded from training data yet included in test distributions, leaving the responsiveness-to-improvements result vulnerable to distribution shift.
- [§3.2] §3.2 (HPS v2 training) and Table 2: The fine-tuning procedure is described at a high level, but the manuscript provides neither the exact loss formulation, learning-rate schedule, nor ablation on the number of negative pairs per prompt. Without these details it is impossible to assess whether the reported gains over baseline CLIP are due to the preference data itself or to hyper-parameter choices.
- [§5] §5 (Benchmark): The ranking of models is presented without error bars, inter-rater agreement statistics on the human labels, or a sensitivity analysis to prompt wording. This weakens the assertion that HPS v2 yields a 'stable, fair and easy-to-use' evaluation protocol.
minor comments (3)
- [Abstract / §2.1] The abstract states that HPD v2 'eliminates potential bias' but does not quantify residual prompt or demographic biases; a short paragraph in §2.1 citing the exact collection protocol would clarify this.
- [Figure 3] Figure 3 (qualitative examples) lacks axis labels and a legend indicating which images correspond to which model; this reduces readability.
- [§3.2] The GitHub link is given, but the manuscript does not specify the exact train/validation split sizes or the random seed used for fine-tuning, hindering reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the paper without altering its core claims.
read point-by-point responses
-
Referee: [§4] §4 (Experiments on generalization): The central claim that HPS v2 'generalizes better than previous metrics across various image distributions' is supported only by comparisons on image sets drawn from the same pool of source models used to build HPD v2. No temporal or architectural hold-out is reported in which entire model families (e.g., post-2023 diffusion variants or novel architectures) are excluded from training data yet included in test distributions, leaving the responsiveness-to-improvements result vulnerable to distribution shift.
Authors: We appreciate this point on rigorous generalization testing. Our Section 4 evaluations do include image sets from diverse sources such as community fine-tunes and industry models (e.g., Midjourney v5, DALL·E variants) whose outputs were not part of HPD v2 training collection, and HPS v2 shows improved correlation with human preferences on these. However, we agree that explicit architectural and temporal hold-outs would further substantiate the claims. In the revised manuscript, we will add new experiments that exclude specific post-2023 model families from HPS v2 training data and evaluate responsiveness on held-out newer architectures, to be included in an expanded Section 4. revision: yes
-
Referee: [§3.2] §3.2 (HPS v2 training) and Table 2: The fine-tuning procedure is described at a high level, but the manuscript provides neither the exact loss formulation, learning-rate schedule, nor ablation on the number of negative pairs per prompt. Without these details it is impossible to assess whether the reported gains over baseline CLIP are due to the preference data itself or to hyper-parameter choices.
Authors: We agree that the training details in Section 3.2 are insufficient for full reproducibility and attribution of gains. The current description was kept high-level to focus on the dataset contribution, but this was an oversight. In the revised manuscript, we will expand Section 3.2 and update Table 2 to specify the exact loss (a contrastive pairwise ranking loss on preference pairs), the learning-rate schedule (AdamW with cosine decay, initial LR of 1e-5), and include an ablation on the number of negative pairs per prompt. These additions will demonstrate that performance improvements are driven by HPD v2 rather than hyper-parameters alone. revision: yes
-
Referee: [§5] §5 (Benchmark): The ranking of models is presented without error bars, inter-rater agreement statistics on the human labels, or a sensitivity analysis to prompt wording. This weakens the assertion that HPS v2 yields a 'stable, fair and easy-to-use' evaluation protocol.
Authors: Thank you for noting these omissions in the benchmark presentation. In the revised Section 5, we will add error bars to the model rankings using bootstrap resampling over evaluation prompts. We will also include a sensitivity analysis varying prompt wording (e.g., adding descriptors or rephrasing) to quantify stability of HPS v2 scores. For inter-rater agreement on the underlying human labels, our collection prioritized scale with single annotations per pair; we will explicitly discuss this as a limitation and note how the dataset size helps average out individual variance. revision: partial
- Inter-rater agreement statistics cannot be computed because the HPD v2 collection process used single annotations per image pair to achieve the reported scale of 798k choices.
Circularity Check
No significant circularity; empirical training and held-out testing are independent
full rationale
The paper explicitly collects HPD v2 human preference data, fine-tunes CLIP to produce HPS v2, and then reports generalization results on various image distributions. This is standard supervised learning with no self-definitional loop, no fitted parameter renamed as a prediction, and no load-bearing self-citation that reduces the central claim to its own inputs. The generalization experiments are presented as tests on independent distributions rather than tautological outputs of the training process.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human preferences over image pairs can be effectively captured and generalized by fine-tuning a pre-trained vision-language model such as CLIP on a large collected dataset.
Lean theorems connected to this paper
-
Foundation.LawOfExistencedefect_zero_iff_one unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our experiments demonstrate that HPS v2 generalizes better than previous metrics across various image distributions and is responsive to algorithmic improvements of text-to-image generative models.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
-
RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution
RankE co-evolves AR policy and decoder via alternating ranking optimization, improving both FID and CLIP scores on LlamaGen-XL and Janus-Pro where policy-only RL degrades FID.
-
Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models
Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.
-
CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation
CAdam reinterprets densification in generative 3DGS as signal verification via gradient-moment interference, quantile context, and SNR gating to achieve large reductions in primitive count with comparable quality.
-
TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design
TASTE supplies designer ratings across nine criteria for outputs from four text-to-image models, with statistical tests showing moderate agreement and benchmarks where existing scorers reach at most 0.55 macro agreeme...
-
Probability-Conserving Flow Guidance
AdaMaG is a guidance rule for generative models derived from decomposing continuity-equation effects into divergence and score-parallel terms, with a proof that divergence diverges near the manifold and a time-depende...
-
AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment
AutoRubric-T2I learns a small set of interpretable rubrics for VLM judges that outperform scalar reward models on T2I benchmarks while using far less preference data.
-
AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment
AutoRubric-T2I learns and selects explicit rubrics from preference pairs to guide VLM judges, producing high-quality interpretable rewards for T2I alignment with far less data than traditional Bradley-Terry models.
-
SeamCam: Quantifying Seamless Camouflage via Multi-Cue Visual Detectability
SeamCam quantifies camouflage by computing one minus the highest IoU recoverable from category-conditioned detection proposals against a ground-truth mask, achieving 78.82% agreement with human judgments.
-
Pareto-Guided Optimal Transport for Multi-Reward Alignment
PG-OT builds prompt-specific Pareto frontiers and applies distribution-aware optimal transport to improve multi-reward alignment while introducing JDR and JCR metrics to measure synergy and hacking.
-
Asymmetric Flow Models
Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...
-
STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models
STRIDE boosts diversity in one-step diffusion models by injecting PCA-aligned pink noise into transformer features while preserving text alignment and quality.
-
Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
Reinforce Adjoint Matching derives a simple consistency loss for RL post-training of diffusion models by tilting the clean distribution toward higher-reward samples under KL regularization while keeping the noising pr...
-
ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models
ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.
-
Attention Sinks in Diffusion Transformers: A Causal Analysis
Suppressing attention sinks in diffusion transformers does not degrade text-image alignment or most preference metrics, revealing a dissociation between generation trajectory changes and semantic output quality.
-
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.
-
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
TMPO replaces scalar reward maximization with trajectory-level matching to a Boltzmann distribution via Softmax-TB, improving generative diversity by 9.1% while keeping competitive reward performance.
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Di...
-
LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling
LENS shapes low-frequency eigen noise with a lightweight network to enable efficient, high-quality sampling in distilled diffusion models.
-
Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models
ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by minimizing distribution differences between a text-only student and a multimodal teacher on the student's o...
-
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.
-
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
-
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...
-
Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation
OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.
-
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
-
Depth Adaptive Efficient Visual Autoregressive Modeling
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
-
Comparison Drives Preference: Reference-Aware Modeling for AI-Generated Video Quality Assessment
RefVQA uses a query-centered reference graph and graph-guided difference aggregation to improve AI-generated video quality assessment by incorporating inter-video comparisons.
-
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
-
OneHOI: Unifying Human-Object Interaction Generation and Editing
OneHOI unifies HOI generation and editing in one conditional diffusion transformer using role-aware tokens, structured attention, and joint training on mixed datasets to reach SOTA on both tasks.
-
SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models
SOAR is a reward-free on-policy method that supplies dense per-timestep supervision to correct exposure bias in diffusion model denoising trajectories, raising GenEval from 0.70 to 0.78 and OCR from 0.64 to 0.67 over ...
-
RewardFlow: Generate Images by Optimizing What You Reward
RewardFlow unifies differentiable rewards including a new VQA-based one and uses a prompt-aware adaptive policy with Langevin dynamics to achieve state-of-the-art image editing and compositional generation.
-
Personalizing Text-to-Image Generation to Individual Taste
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
-
Hierarchical SVG Tokenization: Learning Compact Visual Programs for Scalable Vector Graphics Modeling
HiVG introduces hierarchical SVG tokenization with atomic and segment tokens plus HMN initialization to enable more efficient and stable autoregressive generation of vector graphics programs.
-
1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation
1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.
-
SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis
SHARP applies a spectrum-aware dynamic RoPE scaling schedule that promotes resolution more strongly in early denoising stages and relaxes it later, outperforming static baselines on quality metrics for remote sensing images.
-
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
-
Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching
Stroke of Surprise is a framework that generates vector sketches undergoing semantic transformation from one concept to another by adding strokes, using dual-branch SDS and overlay loss for optimization.
-
Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling
DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.
-
Tiled Prompts: Overcoming Prompt Misguidance in Image and Video Super-Resolution
Tiled Prompts generates tile-specific text prompts for each latent tile in diffusion super-resolution to reduce errors from global prompts and improve perceptual quality.
-
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
LocalDPO creates localized preference pairs from real videos by applying random spatio-temporal masks and restoring masked regions with the frozen base model, then applies region-restricted DPO loss to improve fidelit...
-
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
LocalDPO aligns text-to-video diffusion models with human preferences at the spatio-temporal region level by automatically generating localized preference pairs from corrupted real videos and applying a region-aware DPO loss.
-
It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models
Noise optimization during sampling recovers diversity in mode-collapsed diffusion models while preserving output fidelity.
-
Determinism of Randomness: Prompt-Residual Seed Shaping for Diffusion Generation
A geometric view of semantic anisotropy in diffusion latents motivates a prompt-residual seed-shaping method that improves prompt alignment and visual quality without training.
-
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
-
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.
-
Unified Reward Model for Multimodal Understanding and Generation
UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
-
T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts
T2I-FactualBench is a new three-tier benchmark for factuality of knowledge-intensive concepts in T2I models, using multi-round VQA evaluation to show SOTA models need improvement.
-
SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers
SEGA adaptively scales RoPE attention components using spectral-energy guidance from the latent to improve structural coherence and fine details in high-resolution DiT synthesis.
-
Hierarchical Variational Policies for Reward-Guided Diffusion
A hierarchical variational formulation amortizes test-time guidance in diffusion models to achieve strong quality-speed tradeoffs with significantly reduced inference compute.
-
Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics
A feature supervision approach using SigLIP 2 extracts multi-granularity vision-aligned text representations to supervise MM-DiT image branches, pushing the Pareto frontier for portrait generation across alignment, re...
-
Boosting Text-to-Image Diffusion Models via Core Token Attention-Based Seed Selection
ABSS ranks diffusion seeds by early cross-attention strength to prompt core tokens and retains only the top-k for full generation, yielding consistent gains in alignment and quality on Stable Diffusion variants.
-
Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?
AdaScope adaptively selects optimal RL intervention points during diffusion denoising by monitoring structural and semantic changes, delivering 66% higher performance at 59% lower cost than full-trajectory RL baselines.
-
ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices
ElasticDiT introduces an elastic DiT architecture with adjustable spatial compression and block depth plus Shift Sparse Block Attention and a distilled VAE to enable a single model to cover multiple fidelity-latency p...
-
HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling
HeatKV ranks attention heads by their focus on prior scales using offline calibration data and applies a static per-head pruning schedule, delivering 2x higher KV-cache compression than prior methods on the Infinity-2...
-
Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation
Skill-aligned annotation improves inter-annotator agreement and evaluation stability in text-to-image generation compared to uniform annotation baselines.
-
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
-
EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation
EPIC introduces predicate-guided inference-time search that lifts compositional T2I prompt accuracy from 34% to 71% on GenEval2 with 31-81% lower execution costs.
-
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
-
Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
Derives RAM, a reward-adjusted consistency loss extending diffusion pretraining regression to efficient KL-regularized RL post-training, achieving peak rewards up to 50x faster than Flow-GRPO on Stable Diffusion 3.5M.
Reference graph
Works this paper leans on
-
[1]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft COCO Captions: Data Collection and Evaluation Server, 2015
work page 2015
-
[2]
Cogview2: Faster and better text-to-image generation via hierarchical transformers
Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to-image generation via hierarchical transformers. NeurIPS, 35:16890–16902, 2022
work page 2022
-
[3]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, pages 12873–12883, 2021
work page 2021
-
[4]
Vector quantized diffusion model for text-to-image synthesis
Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In CVPR, pages 10696–10706, 2022
work page 2022
- [5]
-
[6]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022
work page 2022
-
[7]
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS, 2017
work page 2017
-
[8]
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021. If you use this software, please cite it as below
work page 2021
-
[9]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[10]
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation. arXiv preprint arXiv:2305.01569, 2023
-
[11]
Aligning Text-to-Image Models using Human Feedback
Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023
work page internal anchor Pith review arXiv 2023
-
[12]
AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment, 2023
Chunyi Li, Zicheng Zhang, Haoning Wu, Wei Sun, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, and Weisi Lin. AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment, 2023
work page 2023
-
[13]
Fusedream: Training-free text-to-image generation with improved clip+ gan space optimization
Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, and Qiang Liu. Fusedream: Training-free text-to-image generation with improved clip+ gan space optimization. arXiv preprint arXiv:2112.01573, 2021
-
[14]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[15]
Seeing is not always believing: Benchmarking Human and Model Perception of AI-Generated Images, 2023
Zeyu Lu, Di Huang, Lei Bai, Jingjing Qu, Chengyue Wu, Xihui Liu, and Wanli Ouyang. Seeing is not always believing: Benchmarking Human and Model Perception of AI-Generated Images, 2023
work page 2023
-
[16]
A V A: A large-scale database for aesthetic visual analysis
Naila Murray, Luca Marchesotti, and Florent Perronnin. A V A: A large-scale database for aesthetic visual analysis. CVPR, pages 2408–2415, 2012
work page 2012
-
[17]
GLIDE: Towards Photorealistic Image Generation and Editing with Text- Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text- Guided Diffusion Models. In ICML, 2021
work page 2021
-
[18]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. NeurIPS, 35:27730–27744, 2022
work page 2022
-
[19]
John David Pressman, Katherine Crowson, and Simulacra Captions Contributors. Simulacra Aesthetic Captions. Technical Report Version 1.0, Stability AI, 2022
work page 2022
-
[20]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. In ICML, 2021
work page 2021
-
[21]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents. ArXiv, abs/2204.06125, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Zero-Shot Text-to-Image Generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation. ArXiv, abs/2102.12092, 2021
work page internal anchor Pith review arXiv 2021
-
[23]
Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer
Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. CVPR, pages 10674–10685, 2022
work page 2022
-
[24]
Photorealistic text-to- image diffusion models with deep language understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022
work page 2022
-
[25]
Improved techniques for training gans
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems , 29, 2016
work page 2016
-
[26]
Generating images of rare concepts using pre-trained diffusion models, 2023
Dvir Samuel, Rami Ben-Ari, Simon Raviv, Nir Darshan, and Gal Chechik. It is all about where you start: Text-to-image generation with seed selection. arXiv preprint arXiv:2304.14530, 2023
-
[27]
doi:10.48550/arXiv.2301.09515 , urldate =
Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila. StyleGAN-T: Unlocking the power of gans for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515, 2023
-
[28]
LAION-5B: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022
work page internal anchor Pith review arXiv 2022
-
[29]
LAION-5b: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R 10 Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text ...
work page 2022
-
[30]
Proper reuse of image classification features improves object detection
Cristina Vasconcelos, Vighnesh Birodkar, and Vincent Dumoulin. Proper reuse of image classification features improves object detection. In CVPR, pages 13628–13637, 2022
work page 2022
-
[31]
Diffusiondb: A large-scale prompt gallery dataset for text-to- image generative models,
Zijie J Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models. arXiv preprint arXiv:2210.14896, 2022
-
[32]
Better Aligning Text-to-Image Models with Human Preference, 2023
Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Better Aligning Text-to-Image Models with Human Preference, 2023
work page 2023
-
[33]
ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation, 2023
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation, 2023
work page 2023
-
[34]
Versatile diffusion: Text, images and variations all in one diffusion model,
Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, and Humphrey Shi. Versatile Diffusion: Text, Images and Variations All in One Diffusion Model. arXiv preprint arXiv:2211.08332, 2022
-
[35]
LiT: Zero-Shot Transfer With Locked-Image Text Tuning
Zhai, Xiaohua and Wang, Xiao and Mustafa, Basil and Steiner, Andreas and Keysers, Daniel and Kolesnikov, Alexander and Beyer, Lucas. LiT: Zero-Shot Transfer With Locked-Image Text Tuning. In CVPR, pages 18123–18133, June 2022
work page 2022
-
[36]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pages 586–595, 2018
work page 2018
-
[37]
A Perceptual Quality Assessment Exploration for AIGC Images, 2023
Zicheng Zhang, Chunyi Li, Wei Sun, Xiaohong Liu, Xiongkuo Min, and Guangtao Zhai. A Perceptual Quality Assessment Exploration for AIGC Images, 2023
work page 2023
-
[38]
Hype: A benchmark for human eye perceptual evaluation of generative models
Sharon Zhou, Mitchell Gordon, Ranjay Krishna, Austin Narcomey, Li F Fei-Fei, and Michael Bernstein. Hype: A benchmark for human eye perceptual evaluation of generative models. NeurIPS, 32, 2019
work page 2019
-
[39]
Lafite: Towards language-free training for text-to- image generation
Y Zhou, R Zhang, C Chen, C Li, C Tensmeyer, T Yu, J Gu, J Xu, and T Sun. LAFITE: Towards Language-Free Training for Text-to-Image Generation. arXiv 2021. arXiv preprint arXiv:2111.13792. Checklist The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change...
-
[40]
For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper con...
-
[41]
If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]
-
[42]
If you ran experiments (e.g. for benchmarks)... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] Please see the supplemental material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes...
-
[43]
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] We will release our dataset and pre-train mode...
-
[44]
If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if ap- plicable? [Yes] We will show our instructions given to the workers in the supplemental material. (b) Did you describe any potential participant risks, with links to Institutional Review Boar...
- [45]
-
[46]
If the picture belongs to the style of “anime and cartoon”, reply only with “anime and cartoon”
-
[47]
If the picture belongs to the style of “real photo”, reply only with “real photo”
-
[48]
If the picture belongs to the style of “concept-art”, reply only with “concept-art”
-
[49]
others”; You must reply with only on word. Even though prompts of “Photo
If the picture doesn’t belong to any styles of above, reply only with “others”; You must reply with only on word. Even though prompts of “Photo” category in HPD v2 are from COCO Captions [1], we retain “Photo” in the classification process to mitigate the potential mistakes made by ChatGPT. The category distribution of HPD v2 is illustrated in Fig. 7. Add...
-
[50]
prompt, Image (A) should take precedence over Image (B)
When Image (A) surpasses Image (B) in terms of aesthetic appeal and fidelity, or Image (B) suffers from severe distortion and blurriness, even if Image (B) aligns better with the 13 (a) (b) Figure 8: Prompt: A pair of skis standing up against a gate. prompt, Image (A) should take precedence over Image (B). For example, in Fig. 8, Fig. 8(b) lacks clear out...
-
[51]
For example, if you cannot make a choice based on personal preference, as in Fig
When facing a dilemma that images are relatively similar in terms of aesthetics and personal preference, please carefully read and consider the prompt for sorting based more on the text- image alignment. For example, if you cannot make a choice based on personal preference, as in Fig. 10, please pay attention to the description, which refers to a mouse me...
-
[52]
It is crucial to pay special attention to the capitalized names, as these names may lead to misunderstandings during the machine translation process. If there is any incorrectly translated proprietary term or content you are not familiar with, we recommend you to search for sample images and explanations online. 14 (a) (b) Figure 10: Prompt: A ginger hair...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.