pith. sign in

arxiv: 2306.09341 · v2 · submitted 2023-06-15 · 💻 cs.CV · cs.AI· cs.DB

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Pith reviewed 2026-05-11 08:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.DB
keywords human preference datasettext-to-image synthesisevaluation metricCLIP fine-tuningpreference scoringgenerative model benchmarkimage quality assessmentHPS v2
0
0 comments X

The pith

Fine-tuning CLIP on a large bias-reduced dataset of human image choices creates a scorer that aligns better with human judgments on text-to-image outputs than prior metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper collects HPD v2, a dataset of 798,090 human preference choices across 433,760 image pairs drawn from many sources and prompts chosen to reduce bias. Fine-tuning CLIP on these choices produces HPS v2, a model that scores how well generated images match what people prefer. Experiments show this score generalizes across different image distributions and changes when text-to-image models improve their outputs. A reader would care because current automatic metrics often disagree with human opinion, making it hard to know which generative advances are real. The new scorer therefore offers a more trustworthy way to measure and guide progress in image synthesis.

Core claim

By fine-tuning CLIP on HPD v2, which comprises 798,090 human preference choices on 433,760 pairs of images from diverse sources, we obtain HPS v2 that more accurately predicts human preferences on generated images, generalizes better across various image distributions, and is responsive to algorithmic improvements of text-to-image generative models.

What carries the argument

HPS v2, the scoring model obtained by fine-tuning CLIP on the HPD v2 human preference dataset, used to rank and compare outputs from text-to-image generative models.

If this is right

  • Allows more reliable comparison of recent text-to-image models from academic, community, and industry sources via a shared benchmark.
  • Detects when algorithmic changes improve outputs in ways that match human taste rather than proxy scores.
  • Supports stable, fair, and easy-to-use evaluation by guiding the design of text prompts used during scoring.
  • Provides a dataset and model that can serve as a drop-in replacement for weaker automatic metrics in research pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Researchers could close the loop by using HPS v2 as a training signal inside generative models instead of only for post-hoc evaluation.
  • The same preference-collection approach might transfer to related tasks such as text-to-video or image editing where human alignment is also hard to measure.
  • Widespread adoption could shift model development away from optimizing for FID or CLIP score toward outputs that survive direct human comparison.
  • Periodic retraining of the scorer on new preference data would be needed to keep pace with rapid changes in generative model capabilities.

Load-bearing premise

The collected human preferences are unbiased and representative enough that fine-tuning CLIP on them produces a scorer that continues to align with human judgments on future unseen models and image distributions.

What would settle it

Gather fresh human preference judgments on images from a new text-to-image model released after HPD v2 collection, then measure whether HPS v2 correlates more strongly with those judgments than earlier metrics such as CLIP score or FID.

read the original abstract

Recent text-to-image generative models can generate high-fidelity images from text inputs, but the quality of these generated images cannot be accurately evaluated by existing evaluation metrics. To address this issue, we introduce Human Preference Dataset v2 (HPD v2), a large-scale dataset that captures human preferences on images from a wide range of sources. HPD v2 comprises 798,090 human preference choices on 433,760 pairs of images, making it the largest dataset of its kind. The text prompts and images are deliberately collected to eliminate potential bias, which is a common issue in previous datasets. By fine-tuning CLIP on HPD v2, we obtain Human Preference Score v2 (HPS v2), a scoring model that can more accurately predict human preferences on generated images. Our experiments demonstrate that HPS v2 generalizes better than previous metrics across various image distributions and is responsive to algorithmic improvements of text-to-image generative models, making it a preferable evaluation metric for these models. We also investigate the design of the evaluation prompts for text-to-image generative models, to make the evaluation stable, fair and easy-to-use. Finally, we establish a benchmark for text-to-image generative models using HPS v2, which includes a set of recent text-to-image models from the academic, community and industry. The code and dataset is available at https://github.com/tgxs002/HPSv2 .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces Human Preference Dataset v2 (HPD v2), comprising 798,090 human preference choices over 433,760 image pairs drawn from diverse text-to-image sources, with deliberate collection to reduce bias. Fine-tuning CLIP on HPD v2 yields Human Preference Score v2 (HPS v2), which the authors claim generalizes better than prior metrics (e.g., CLIP, Aesthetic Score) across image distributions and responds to algorithmic improvements in generative models. The work also examines prompt design for stable evaluation and releases a benchmark ranking recent T2I models from academia, community, and industry.

Significance. If the generalization and responsiveness claims hold under rigorous validation, HPS v2 would supply a human-aligned, practical metric that improves upon distribution-based scores like FID or uncalibrated CLIP similarity for T2I evaluation. The scale of HPD v2 and the public benchmark constitute a concrete resource for the field, provided the scorer's alignment persists on future model families.

major comments (3)
  1. [§4] §4 (Experiments on generalization): The central claim that HPS v2 'generalizes better than previous metrics across various image distributions' is supported only by comparisons on image sets drawn from the same pool of source models used to build HPD v2. No temporal or architectural hold-out is reported in which entire model families (e.g., post-2023 diffusion variants or novel architectures) are excluded from training data yet included in test distributions, leaving the responsiveness-to-improvements result vulnerable to distribution shift.
  2. [§3.2] §3.2 (HPS v2 training) and Table 2: The fine-tuning procedure is described at a high level, but the manuscript provides neither the exact loss formulation, learning-rate schedule, nor ablation on the number of negative pairs per prompt. Without these details it is impossible to assess whether the reported gains over baseline CLIP are due to the preference data itself or to hyper-parameter choices.
  3. [§5] §5 (Benchmark): The ranking of models is presented without error bars, inter-rater agreement statistics on the human labels, or a sensitivity analysis to prompt wording. This weakens the assertion that HPS v2 yields a 'stable, fair and easy-to-use' evaluation protocol.
minor comments (3)
  1. [Abstract / §2.1] The abstract states that HPD v2 'eliminates potential bias' but does not quantify residual prompt or demographic biases; a short paragraph in §2.1 citing the exact collection protocol would clarify this.
  2. [Figure 3] Figure 3 (qualitative examples) lacks axis labels and a legend indicating which images correspond to which model; this reduces readability.
  3. [§3.2] The GitHub link is given, but the manuscript does not specify the exact train/validation split sizes or the random seed used for fine-tuning, hindering reproducibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the paper without altering its core claims.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments on generalization): The central claim that HPS v2 'generalizes better than previous metrics across various image distributions' is supported only by comparisons on image sets drawn from the same pool of source models used to build HPD v2. No temporal or architectural hold-out is reported in which entire model families (e.g., post-2023 diffusion variants or novel architectures) are excluded from training data yet included in test distributions, leaving the responsiveness-to-improvements result vulnerable to distribution shift.

    Authors: We appreciate this point on rigorous generalization testing. Our Section 4 evaluations do include image sets from diverse sources such as community fine-tunes and industry models (e.g., Midjourney v5, DALL·E variants) whose outputs were not part of HPD v2 training collection, and HPS v2 shows improved correlation with human preferences on these. However, we agree that explicit architectural and temporal hold-outs would further substantiate the claims. In the revised manuscript, we will add new experiments that exclude specific post-2023 model families from HPS v2 training data and evaluate responsiveness on held-out newer architectures, to be included in an expanded Section 4. revision: yes

  2. Referee: [§3.2] §3.2 (HPS v2 training) and Table 2: The fine-tuning procedure is described at a high level, but the manuscript provides neither the exact loss formulation, learning-rate schedule, nor ablation on the number of negative pairs per prompt. Without these details it is impossible to assess whether the reported gains over baseline CLIP are due to the preference data itself or to hyper-parameter choices.

    Authors: We agree that the training details in Section 3.2 are insufficient for full reproducibility and attribution of gains. The current description was kept high-level to focus on the dataset contribution, but this was an oversight. In the revised manuscript, we will expand Section 3.2 and update Table 2 to specify the exact loss (a contrastive pairwise ranking loss on preference pairs), the learning-rate schedule (AdamW with cosine decay, initial LR of 1e-5), and include an ablation on the number of negative pairs per prompt. These additions will demonstrate that performance improvements are driven by HPD v2 rather than hyper-parameters alone. revision: yes

  3. Referee: [§5] §5 (Benchmark): The ranking of models is presented without error bars, inter-rater agreement statistics on the human labels, or a sensitivity analysis to prompt wording. This weakens the assertion that HPS v2 yields a 'stable, fair and easy-to-use' evaluation protocol.

    Authors: Thank you for noting these omissions in the benchmark presentation. In the revised Section 5, we will add error bars to the model rankings using bootstrap resampling over evaluation prompts. We will also include a sensitivity analysis varying prompt wording (e.g., adding descriptors or rephrasing) to quantify stability of HPS v2 scores. For inter-rater agreement on the underlying human labels, our collection prioritized scale with single annotations per pair; we will explicitly discuss this as a limitation and note how the dataset size helps average out individual variance. revision: partial

standing simulated objections not resolved
  • Inter-rater agreement statistics cannot be computed because the HPD v2 collection process used single annotations per image pair to achieve the reported scale of 798k choices.

Circularity Check

0 steps flagged

No significant circularity; empirical training and held-out testing are independent

full rationale

The paper explicitly collects HPD v2 human preference data, fine-tunes CLIP to produce HPS v2, and then reports generalization results on various image distributions. This is standard supervised learning with no self-definitional loop, no fitted parameter renamed as a prediction, and no load-bearing self-citation that reduces the central claim to its own inputs. The generalization experiments are presented as tests on independent distributions rather than tautological outputs of the training process.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pairwise human preference data can be used to fine-tune a vision-language model into a generalizable scorer, plus standard machine-learning assumptions about generalization from training data.

axioms (1)
  • domain assumption Human preferences over image pairs can be effectively captured and generalized by fine-tuning a pre-trained vision-language model such as CLIP on a large collected dataset.
    Invoked when the paper states that fine-tuning CLIP on HPD v2 yields a scoring model that predicts human preferences.

pith-pipeline@v0.9.0 · 5579 in / 1359 out tokens · 70025 ms · 2026-05-11T08:23:18.684603+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.LawOfExistence defect_zero_iff_one unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our experiments demonstrate that HPS v2 generalizes better than previous metrics across various image distributions and is responsive to algorithmic improvements of text-to-image generative models.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

    cs.CV 2026-04 unverdicted novelty 8.0

    OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

  2. RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

    cs.CV 2026-05 conditional novelty 7.0

    RankE co-evolves AR policy and decoder via alternating ranking optimization, improving both FID and CLIP scores on LlamaGen-XL and Janus-Pro where policy-only RL degrades FID.

  3. Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.

  4. CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    CAdam reinterprets densification in generative 3DGS as signal verification via gradient-moment interference, quantile context, and SNR gating to achieve large reductions in primitive count with comparable quality.

  5. TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design

    cs.CV 2026-05 conditional novelty 7.0

    TASTE supplies designer ratings across nine criteria for outputs from four text-to-image models, with statistical tests showing moderate agreement and benchmarks where existing scorers reach at most 0.55 macro agreeme...

  6. Probability-Conserving Flow Guidance

    cs.CV 2026-05 unverdicted novelty 7.0

    AdaMaG is a guidance rule for generative models derived from decomposing continuity-equation effects into divergence and score-parallel terms, with a proof that divergence diverges near the manifold and a time-depende...

  7. AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

    cs.AI 2026-05 unverdicted novelty 7.0

    AutoRubric-T2I learns a small set of interpretable rubrics for VLM judges that outperform scalar reward models on T2I benchmarks while using far less preference data.

  8. AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

    cs.AI 2026-05 unverdicted novelty 7.0

    AutoRubric-T2I learns and selects explicit rubrics from preference pairs to guide VLM judges, producing high-quality interpretable rewards for T2I alignment with far less data than traditional Bradley-Terry models.

  9. SeamCam: Quantifying Seamless Camouflage via Multi-Cue Visual Detectability

    cs.CV 2026-05 conditional novelty 7.0

    SeamCam quantifies camouflage by computing one minus the highest IoU recoverable from category-conditioned detection proposals against a ground-truth mask, achieving 78.82% agreement with human judgments.

  10. Pareto-Guided Optimal Transport for Multi-Reward Alignment

    cs.CV 2026-05 unverdicted novelty 7.0

    PG-OT builds prompt-specific Pareto frontiers and applies distribution-aware optimal transport to improve multi-reward alignment while introducing JDR and JCR metrics to measure synergy and hacking.

  11. Asymmetric Flow Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...

  12. STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models

    cs.CV 2026-05 unverdicted novelty 7.0

    STRIDE boosts diversity in one-step diffusion models by injecting PCA-aligned pink noise into transformer features while preserving text alignment and quality.

  13. Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Reinforce Adjoint Matching derives a simple consistency loss for RL post-training of diffusion models by tilting the clean distribution toward higher-reward samples under KL regularization while keeping the noising pr...

  14. ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.

  15. Attention Sinks in Diffusion Transformers: A Causal Analysis

    cs.CV 2026-05 unverdicted novelty 7.0

    Suppressing attention sinks in diffusion transformers does not degrade text-image alignment or most preference metrics, revealing a dissociation between generation trajectory changes and semantic output quality.

  16. TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.

  17. TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    TMPO replaces scalar reward maximization with trajectory-level matching to a Boltzmann distribution via Softmax-TB, improving generative diversity by 9.1% while keeping competitive reward performance.

  18. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 conditional novelty 7.0

    Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Di...

  19. LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling

    cs.CV 2026-05 unverdicted novelty 7.0

    LENS shapes low-frequency eigen noise with a lightweight network to enable efficient, high-quality sampling in distilled diffusion models.

  20. Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.

  21. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 7.0

    D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by minimizing distribution differences between a text-only student and a multimodal teacher on the student's o...

  22. How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

    cs.LG 2026-04 unverdicted novelty 7.0

    FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.

  23. Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization

    cs.CV 2026-04 unverdicted novelty 7.0

    Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.

  24. $Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...

  25. Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.

  26. Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.

  27. Depth Adaptive Efficient Visual Autoregressive Modeling

    cs.CV 2026-04 unverdicted novelty 7.0

    DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.

  28. Comparison Drives Preference: Reference-Aware Modeling for AI-Generated Video Quality Assessment

    cs.CV 2026-04 unverdicted novelty 7.0

    RefVQA uses a query-centered reference graph and graph-guided difference aggregation to improve AI-generated video quality assessment by incorporating inter-video comparisons.

  29. LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

    cs.CV 2026-04 unverdicted novelty 7.0

    LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

  30. OneHOI: Unifying Human-Object Interaction Generation and Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    OneHOI unifies HOI generation and editing in one conditional diffusion transformer using role-aware tokens, structured attention, and joint training on mixed datasets to reach SOTA on both tasks.

  31. SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

    cs.LG 2026-04 unverdicted novelty 7.0

    SOAR is a reward-free on-policy method that supplies dense per-timestep supervision to correct exposure bias in diffusion model denoising trajectories, raising GenEval from 0.70 to 0.78 and OCR from 0.64 to 0.67 over ...

  32. RewardFlow: Generate Images by Optimizing What You Reward

    cs.CV 2026-04 unverdicted novelty 7.0

    RewardFlow unifies differentiable rewards including a new VQA-based one and uses a prompt-aware adaptive policy with Langevin dynamics to achieve state-of-the-art image editing and compositional generation.

  33. Personalizing Text-to-Image Generation to Individual Taste

    cs.CV 2026-04 unverdicted novelty 7.0

    PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.

  34. Hierarchical SVG Tokenization: Learning Compact Visual Programs for Scalable Vector Graphics Modeling

    cs.LG 2026-04 unverdicted novelty 7.0

    HiVG introduces hierarchical SVG tokenization with atomic and segment tokens plus HMN initialization to enable more efficient and stable autoregressive generation of vector graphics programs.

  35. 1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation

    cs.CV 2026-04 conditional novelty 7.0

    1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.

  36. SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis

    cs.CV 2026-03 conditional novelty 7.0

    SHARP applies a spectrum-aware dynamic RoPE scaling schedule that promotes resolution more strongly in early denoising stages and relaxes it later, outperforming static baselines on quality metrics for remote sensing images.

  37. Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

    cs.CV 2026-03 unverdicted novelty 7.0

    SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.

  38. Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching

    cs.CV 2026-02 unverdicted novelty 7.0

    Stroke of Surprise is a framework that generates vector sketches undergoing semantic transformation from one concept to another by adding strokes, using dual-branch SDS and overlay loss for optimization.

  39. Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

    cs.CV 2026-02 unverdicted novelty 7.0

    DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.

  40. Tiled Prompts: Overcoming Prompt Misguidance in Image and Video Super-Resolution

    cs.CV 2026-02 unverdicted novelty 7.0

    Tiled Prompts generates tile-specific text prompts for each latent tile in diffusion super-resolution to reduce errors from global prompts and improve perceptual quality.

  41. Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

    cs.CV 2026-01 unverdicted novelty 7.0

    LocalDPO creates localized preference pairs from real videos by applying random spatio-temporal masks and restoring masked regions with the frozen base model, then applies region-restricted DPO loss to improve fidelit...

  42. Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

    cs.CV 2026-01 unverdicted novelty 7.0

    LocalDPO aligns text-to-video diffusion models with human preferences at the spatio-temporal region level by automatically generating localized preference pairs from corrupted real videos and applying a region-aware DPO loss.

  43. It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models

    cs.CV 2025-12 unverdicted novelty 7.0

    Noise optimization during sampling recovers diversity in mode-collapsed diffusion models while preserving output fidelity.

  44. Determinism of Randomness: Prompt-Residual Seed Shaping for Diffusion Generation

    cs.CV 2025-11 unverdicted novelty 7.0

    A geometric view of semantic anisotropy in diffusion latents motivates a prompt-residual seed-shaping method that improves prompt alignment and visual quality without training.

  45. MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    cs.AI 2025-07 unverdicted novelty 7.0

    MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.

  46. WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    cs.CV 2025-03 unverdicted novelty 7.0

    Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.

  47. Unified Reward Model for Multimodal Understanding and Generation

    cs.CV 2025-03 unverdicted novelty 7.0

    UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.

  48. T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts

    cs.CV 2024-12 unverdicted novelty 7.0

    T2I-FactualBench is a new three-tier benchmark for factuality of knowledge-intensive concepts in T2I models, using multi-round VQA evaluation to show SOTA models need improvement.

  49. SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    SEGA adaptively scales RoPE attention components using spectral-energy guidance from the latent to improve structural coherence and fine details in high-resolution DiT synthesis.

  50. Hierarchical Variational Policies for Reward-Guided Diffusion

    cs.LG 2026-05 conditional novelty 6.0

    A hierarchical variational formulation amortizes test-time guidance in diffusion models to achieve strong quality-speed tradeoffs with significantly reduced inference compute.

  51. Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics

    cs.CV 2026-05 unverdicted novelty 6.0

    A feature supervision approach using SigLIP 2 extracts multi-granularity vision-aligned text representations to supervise MM-DiT image branches, pushing the Pareto frontier for portrait generation across alignment, re...

  52. Boosting Text-to-Image Diffusion Models via Core Token Attention-Based Seed Selection

    cs.CV 2026-05 unverdicted novelty 6.0

    ABSS ranks diffusion seeds by early cross-attention strength to prompt core tokens and retains only the top-k for full generation, yielding consistent gains in alignment and quality on Stable Diffusion variants.

  53. Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?

    cs.CV 2026-05 unverdicted novelty 6.0

    AdaScope adaptively selects optimal RL intervention points during diffusion denoising by monitoring structural and semantic changes, delivering 66% higher performance at 59% lower cost than full-trajectory RL baselines.

  54. ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

    cs.CV 2026-05 unverdicted novelty 6.0

    ElasticDiT introduces an elastic DiT architecture with adjustable spatial compression and block depth plus Shift Sparse Block Attention and a distilled VAE to enable a single model to cover multiple fidelity-latency p...

  55. HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

    cs.CV 2026-05 unverdicted novelty 6.0

    HeatKV ranks attention heads by their focus on prior scales using offline calibration data and applies a static per-head pruning schedule, delivering 2x higher KV-cache compression than prior methods on the Infinity-2...

  56. Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Skill-aligned annotation improves inter-annotator agreement and evaluation stability in text-to-image generation compared to uniform annotation baselines.

  57. When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

    cs.CV 2026-05 unverdicted novelty 6.0

    Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...

  58. EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    EPIC introduces predicate-guided inference-time search that lifts compositional T2I prompt accuracy from 34% to 71% on GenEval2 with 31-81% lower execution costs.

  59. Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

    cs.CV 2026-05 unverdicted novelty 6.0

    Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...

  60. Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Derives RAM, a reward-adjusted consistency loss extending diffusion pretraining regression to efficient KL-regularized RL post-training, achieving peak rewards up to 50x faster than Flow-GRPO on Stable Diffusion 3.5M.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 103 Pith papers · 6 internal anchors

  1. [1]

    Lawrence Zitnick

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft COCO Captions: Data Collection and Evaluation Server, 2015

  2. [2]

    Cogview2: Faster and better text-to-image generation via hierarchical transformers

    Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to-image generation via hierarchical transformers. NeurIPS, 35:16890–16902, 2022

  3. [3]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, pages 12873–12883, 2021

  4. [4]

    Vector quantized diffusion model for text-to-image synthesis

    Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In CVPR, pages 10696–10706, 2022

  5. [5]

    Detoxify

    Laura Hanu and Unitary team. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020

  6. [6]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022

  7. [7]

    GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS, 2017

  8. [8]

    Openclip, July 2021

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021. If you use this software, please cite it as below

  9. [9]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  10. [10]

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation. arXiv preprint arXiv:2305.01569, 2023

  11. [11]

    Aligning Text-to-Image Models using Human Feedback

    Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023

  12. [12]

    AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment, 2023

    Chunyi Li, Zicheng Zhang, Haoning Wu, Wei Sun, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, and Weisi Lin. AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment, 2023

  13. [13]

    Fusedream: Training-free text-to-image generation with improved clip+ gan space optimization

    Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, and Qiang Liu. Fusedream: Training-free text-to-image generation with improved clip+ gan space optimization. arXiv preprint arXiv:2112.01573, 2021

  14. [14]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  15. [15]

    Seeing is not always believing: Benchmarking Human and Model Perception of AI-Generated Images, 2023

    Zeyu Lu, Di Huang, Lei Bai, Jingjing Qu, Chengyue Wu, Xihui Liu, and Wanli Ouyang. Seeing is not always believing: Benchmarking Human and Model Perception of AI-Generated Images, 2023

  16. [16]

    A V A: A large-scale database for aesthetic visual analysis

    Naila Murray, Luca Marchesotti, and Florent Perronnin. A V A: A large-scale database for aesthetic visual analysis. CVPR, pages 2408–2415, 2012

  17. [17]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text- Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text- Guided Diffusion Models. In ICML, 2021

  18. [18]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. NeurIPS, 35:27730–27744, 2022

  19. [19]

    Simulacra Aesthetic Captions

    John David Pressman, Katherine Crowson, and Simulacra Captions Contributors. Simulacra Aesthetic Captions. Technical Report Version 1.0, Stability AI, 2022

  20. [20]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. In ICML, 2021

  21. [21]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents. ArXiv, abs/2204.06125, 2022

  22. [22]

    Zero-Shot Text-to-Image Generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation. ArXiv, abs/2102.12092, 2021

  23. [23]

    Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

    Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. CVPR, pages 10674–10685, 2022

  24. [24]

    Photorealistic text-to- image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022

  25. [25]

    Improved techniques for training gans

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems , 29, 2016

  26. [26]

    Generating images of rare concepts using pre-trained diffusion models, 2023

    Dvir Samuel, Rami Ben-Ari, Simon Raviv, Nir Darshan, and Gal Chechik. It is all about where you start: Text-to-image generation with seed selection. arXiv preprint arXiv:2304.14530, 2023

  27. [27]

    doi:10.48550/arXiv.2301.09515 , urldate =

    Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila. StyleGAN-T: Unlocking the power of gans for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515, 2023

  28. [28]

    LAION-5B: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022

  29. [29]

    LAION-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R 10 Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text ...

  30. [30]

    Proper reuse of image classification features improves object detection

    Cristina Vasconcelos, Vighnesh Birodkar, and Vincent Dumoulin. Proper reuse of image classification features improves object detection. In CVPR, pages 13628–13637, 2022

  31. [31]

    Diffusiondb: A large-scale prompt gallery dataset for text-to- image generative models,

    Zijie J Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models. arXiv preprint arXiv:2210.14896, 2022

  32. [32]

    Better Aligning Text-to-Image Models with Human Preference, 2023

    Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Better Aligning Text-to-Image Models with Human Preference, 2023

  33. [33]

    ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation, 2023

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation, 2023

  34. [34]

    Versatile diffusion: Text, images and variations all in one diffusion model,

    Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, and Humphrey Shi. Versatile Diffusion: Text, Images and Variations All in One Diffusion Model. arXiv preprint arXiv:2211.08332, 2022

  35. [35]

    LiT: Zero-Shot Transfer With Locked-Image Text Tuning

    Zhai, Xiaohua and Wang, Xiao and Mustafa, Basil and Steiner, Andreas and Keysers, Daniel and Kolesnikov, Alexander and Beyer, Lucas. LiT: Zero-Shot Transfer With Locked-Image Text Tuning. In CVPR, pages 18123–18133, June 2022

  36. [36]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pages 586–595, 2018

  37. [37]

    A Perceptual Quality Assessment Exploration for AIGC Images, 2023

    Zicheng Zhang, Chunyi Li, Wei Sun, Xiaohong Liu, Xiongkuo Min, and Guangtao Zhai. A Perceptual Quality Assessment Exploration for AIGC Images, 2023

  38. [38]

    Hype: A benchmark for human eye perceptual evaluation of generative models

    Sharon Zhou, Mitchell Gordon, Ranjay Krishna, Austin Narcomey, Li F Fei-Fei, and Michael Bernstein. Hype: A benchmark for human eye perceptual evaluation of generative models. NeurIPS, 32, 2019

  39. [39]

    Lafite: Towards language-free training for text-to- image generation

    Y Zhou, R Zhang, C Chen, C Li, C Tensmeyer, T Yu, J Gu, J Xu, and T Sun. LAFITE: Towards Language-Free Training for Text-to-Image Generation. arXiv 2021. arXiv preprint arXiv:2111.13792. Checklist The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change...

  40. [40]

    For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper con...

  41. [41]

    (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

    If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

  42. [42]

    for benchmarks)

    If you ran experiments (e.g. for benchmarks)... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] Please see the supplemental material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes...

  43. [43]

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] We will release our dataset and pre-train mode...

  44. [44]

    If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if ap- plicable? [Yes] We will show our instructions given to the workers in the supplemental material. (b) Did you describe any potential participant risks, with links to Institutional Review Boar...

  45. [45]

    paintings

    If the picture belongs to the style of “paintings”, reply only with “paintings”

  46. [46]

    anime and cartoon

    If the picture belongs to the style of “anime and cartoon”, reply only with “anime and cartoon”

  47. [47]

    real photo

    If the picture belongs to the style of “real photo”, reply only with “real photo”

  48. [48]

    concept-art

    If the picture belongs to the style of “concept-art”, reply only with “concept-art”

  49. [49]

    others”; You must reply with only on word. Even though prompts of “Photo

    If the picture doesn’t belong to any styles of above, reply only with “others”; You must reply with only on word. Even though prompts of “Photo” category in HPD v2 are from COCO Captions [1], we retain “Photo” in the classification process to mitigate the potential mistakes made by ChatGPT. The category distribution of HPD v2 is illustrated in Fig. 7. Add...

  50. [50]

    prompt, Image (A) should take precedence over Image (B)

    When Image (A) surpasses Image (B) in terms of aesthetic appeal and fidelity, or Image (B) suffers from severe distortion and blurriness, even if Image (B) aligns better with the 13 (a) (b) Figure 8: Prompt: A pair of skis standing up against a gate. prompt, Image (A) should take precedence over Image (B). For example, in Fig. 8, Fig. 8(b) lacks clear out...

  51. [51]

    For example, if you cannot make a choice based on personal preference, as in Fig

    When facing a dilemma that images are relatively similar in terms of aesthetics and personal preference, please carefully read and consider the prompt for sorting based more on the text- image alignment. For example, if you cannot make a choice based on personal preference, as in Fig. 10, please pay attention to the description, which refers to a mouse me...

  52. [52]

    animation

    It is crucial to pay special attention to the capitalized names, as these names may lead to misunderstandings during the machine translation process. If there is any incorrectly translated proprietary term or content you are not familiar with, we recommend you to search for sample images and explanations online. 14 (a) (b) Figure 10: Prompt: A ginger hair...