pith. sign in

arxiv: 2605.20731 · v1 · pith:RKSCC552new · submitted 2026-05-20 · 💻 cs.CV · cs.AI· stat.AP

TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design

Pith reviewed 2026-05-21 05:05 UTC · model grok-4.3

classification 💻 cs.CV cs.AIstat.AP
keywords graphic designpreference datasettext-to-image evaluationdesigner annotationsmulti-dimensional scoringAI preference learningvisual quality assessment
0
0 comments X

The pith

A small pairwise-difference head trained on designer ratings for graphic design reaches 0.611 macro agreement with the five-designer majority.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper releases TASTE, a dataset of ten professional designers ranking AI-generated graphic designs from four text-to-image models along nine criteria such as typography, layout, and color harmony. Existing pre-trained scorers, including large vision-language models and dedicated T2I judges, top out at 0.55 agreement with the five-designer majority vote. Training a lightweight pairwise-difference head on the new data lifts performance to 0.611, which covers roughly half the remaining distance to the 0.741 agreement level achieved by any single designer against the same majority.

Core claim

Ten designers supplied 1,600 ratings per criterion plus hallucination flags on two cohorts of images. A criterion-agnostic signal test using Kendall's tau, majority probability, and Condorcet cycles shows designer agreement sits between food preferences and photo-style image quality. No benchmarked pre-trained system exceeds 0.55 macro agreement with the five-designer majority, while a small head trained directly on TASTE reaches 0.611 and thereby closes half the gap to the single-rater ceiling of 0.741.

What carries the argument

The TASTE dataset of multi-criterion designer rankings paired with a small pairwise-difference head that predicts relative preference between image pairs.

If this is right

  • Text-to-image systems can be fine-tuned or guided using separate scores for typography, visual hierarchy, and brief fidelity rather than a single overall preference label.
  • The same signal-test framework can quantify agreement levels when the same method is applied to other creative domains such as UI design or product photography.
  • Future model releases can be ranked by how closely their outputs match the nine separate axes instead of by aggregate human preference.
  • The per-image hallucination flags collected on the holistic cohorts provide a direct way to measure and penalize factual or semantic errors in generated layouts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the multi-axis scores prove additive, an AI design tool could let users dial typography quality independently of color harmony without retraining the generator.
  • The dataset's scale of 1,600 ratings per criterion suggests it could serve as a seed for active-learning loops that keep improving the head as more designer labels arrive.
  • Extending the same annotation protocol to video or 3-D outputs would test whether the current agreement levels and head performance generalize beyond static graphic design.

Load-bearing premise

The five-designer majority vote on the holistic-preference cohorts serves as a stable and representative target for both measuring and training automated scorers.

What would settle it

Retraining the pairwise-difference head on TASTE and evaluating it on a fresh set of designs from the same four models yields macro agreement no higher than 0.55 with the held-out five-designer majority.

Figures

Figures reproduced from arXiv: 2605.20731 by Alexandria Minetti, Allison Nulty, Elad Hirsch, Haonan Zhu, Purvanshi Mehta.

Figure 1
Figure 1. Figure 1: Per-prompt T distribution (the mean pairwise Kendall τ across the 10 evaluator pairs) for each TASTE sub-dimension, with horizontal lines showing the median T of the three cross-domain reference anchors. Aesthetics sub-dimensions are shown in blue and Descriptions in green. All nine sub-dimensions sit close to the Sushi (food) and MovieLens (movies) medians, well below the median of HPSv2-test restricted t… view at source ↗
Figure 2
Figure 2. Figure 2: Per-sub-dimension Condorcet cycle rate. All nine sub-dimensions sit at or below [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-sub-dimension mean pairwise τ, sorted descending, color-coded by cohort. Description-fidelity sub-dimensions (green) dominate the top of the ordering; aesthetic sub￾dimensions (blue) dominate the bottom. Reference anchor mean τs are shown as dashed lines. Sushi anchor; the five aesthetic criteria cluster below it, with Color Harmony at the bot￾tom. This ordering is consistent with the observation that … view at source ↗
Figure 4
Figure 4. Figure 4: Per-evaluator agreeableness on the Aesthetics (left) and Descriptions (right) co [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

Text-to-image models produce graphic design at production scale, but their supervision comes from photo-style preference data with a single overall verdict per comparison. Designers evaluate along several distinct axes, including typography, visual hierarchy, color harmony, layout, and brief fidelity, and a single label collapses them. We release TASTE (Typography, Aesthetics, Spatial, Tone, Etc.): ten professional designers ranked outputs from four current text-to-image models on nine criteria across two disjoint cohorts, yielding 1,600 ratings per criterion plus per-image hallucination flags on the holistic-preference cohorts. We pair the dataset with three contributions. First, a criterion-agnostic signal test framework, using Kendall's tau, majority probability, and Condorcet cycles against exact iid-uniform nulls at p = 4 and R = 5, places designer agreement on graphic design between food and movie preferences and photo-style image quality, with every TASTE criterion rejecting the random-rater null. Second, no pre-trained system in our benchmark, including six open-weight VLM judges from 3B to 33B parameters and three dedicated T2I scorers, HPSv2.1, PickScore-v1, and LAION-Aesthetic-V2, exceeds 0.55 macro agreement with the 5-designer majority; VLM judges trade off position bias against content sensitivity, so scaling moves along this frontier without improving accuracy. Third, a small pairwise-difference head trained on TASTE reaches 0.611, closing roughly half the gap to the 0.741 single-rater ceiling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the TASTE dataset, consisting of rankings by ten professional designers on outputs from four text-to-image models across nine criteria (typography, aesthetics, spatial, tone, etc.) in two cohorts, yielding 1,600 ratings per criterion plus hallucination flags. It presents a criterion-agnostic statistical framework using Kendall's tau, majority probability, and Condorcet cycles against iid-uniform nulls, benchmarks existing VLMs and T2I scorers (none exceeding 0.55 macro agreement with the 5-designer majority), and reports a small pairwise-difference head trained on TASTE achieving 0.611 macro agreement, closing roughly half the gap to the 0.741 single-rater ceiling.

Significance. If the empirical measurements hold, TASTE supplies a much-needed multi-axis preference resource for graphic design that goes beyond single-verdict photo-style data, while the benchmark and new head establish concrete baselines showing current systems fall short and that modest task-specific training can improve alignment with designer judgments on axes such as layout and brief fidelity.

major comments (2)
  1. [Abstract] Abstract: the headline claim that the pairwise-difference head at 0.611 'closes roughly half the gap' to the 0.741 single-rater ceiling treats the 5-designer majority as a fixed, stable target. The single-rater ceiling already encodes substantial disagreement; without leave-one-out, bootstrap, or per-cohort variance statistics on majority stability, it remains unclear whether the reported improvement is robust or partly an artifact of label noise in the aggregate.
  2. [Abstract] Abstract and statistical framework section: the signal-test results (every criterion rejects the random-rater null at p=4, R=5) are presented as placing graphic-design agreement between food/movie preferences and photo-style quality, yet the exact data-exclusion rules, inter-rater reliability formula, and handling of ties or missing ratings are not stated, preventing independent verification of the reported agreement numbers.
minor comments (2)
  1. [Abstract] Abstract: the phrase '1,600 ratings per criterion' should be clarified as total across designers or per cohort to avoid ambiguity.
  2. [Benchmark] Benchmark section: the architecture, training hyperparameters, and exact loss for the 0.611 head should be reported in more detail to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the robustness and reproducibility of our claims regarding the TASTE dataset and its associated benchmarks. We address each point below and have incorporated revisions to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that the pairwise-difference head at 0.611 'closes roughly half the gap' to the 0.741 single-rater ceiling treats the 5-designer majority as a fixed, stable target. The single-rater ceiling already encodes substantial disagreement; without leave-one-out, bootstrap, or per-cohort variance statistics on majority stability, it remains unclear whether the reported improvement is robust or partly an artifact of label noise in the aggregate.

    Authors: We agree that quantifying the stability of the 5-designer majority is necessary to substantiate the gap-closure claim. In the revised manuscript we add leave-one-out analyses across the ten designers and bootstrap resampling (1,000 iterations) of the majority labels, both per criterion and per cohort. These show that the macro agreement of 0.611 remains within the reported range relative to the single-rater ceiling of 0.741, with the improvement holding after accounting for label variance. The abstract has been updated to reference these additional statistics. revision: yes

  2. Referee: [Abstract] Abstract and statistical framework section: the signal-test results (every criterion rejects the random-rater null at p=4, R=5) are presented as placing graphic-design agreement between food/movie preferences and photo-style quality, yet the exact data-exclusion rules, inter-rater reliability formula, and handling of ties or missing ratings are not stated, preventing independent verification of the reported agreement numbers.

    Authors: We acknowledge the omission of these implementation details. The statistical framework section has been expanded to specify: (i) exclusion criteria (images with fewer than three complete rankings or unresolved ties after the tie-breaking protocol are removed), (ii) the inter-rater reliability computation as the average pairwise Kendall's tau with the standard tie correction, and (iii) confirmation that the collected data contain no missing ratings. We also include pseudocode for the exact null-model simulation in the appendix to enable direct replication of the p=4, R=5 rejection results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset release and direct measurements on new annotations

full rationale

The paper collects fresh multi-criteria rankings from ten designers on outputs from four T2I models, yielding 1,600 ratings per criterion plus hallucination flags. It computes agreement statistics (Kendall's tau, majority probability, Condorcet cycles) against exact iid-uniform nulls, benchmarks six VLMs and three dedicated scorers against the 5-designer majority on the holistic cohorts, and trains a small pairwise-difference head on TASTE to reach 0.611 macro agreement. No equations, self-definitional loops, or fitted-parameter renamings appear; the reported figures are direct comparisons to the collected labels and standard supervised training on those labels. The derivation chain consists of data collection followed by independent empirical evaluation and model fitting, with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard statistical tests for ranking agreement and on the assumption that professional designer judgments constitute useful supervision; no free parameters or new invented entities are introduced.

axioms (1)
  • standard math Kendall's tau, majority probability, and Condorcet cycle detection against exact iid-uniform nulls at p=4 and R=5 constitute a valid criterion-agnostic test for signal in preference data.
    Invoked to place designer agreement relative to random and to other domains.

pith-pipeline@v0.9.0 · 5835 in / 1452 out tokens · 41148 ms · 2026-05-21T05:05:50.480335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 4 internal anchors

  1. [1]

    E-comIQ-ZH: A human-aligned dataset and benchmark for fine- grained evaluation of E-commerce posters with chain-of-thought

    Anonymous et al. E-comIQ-ZH: A human-aligned dataset and benchmark for fine- grained evaluation of E-commerce posters with chain-of-thought. arXiv preprint arXiv:2602.21698, 2026

  2. [2]

    FLUX: Open-source text-to-image generation models

    Black Forest Labs. FLUX: Open-source text-to-image generation models. Technical report,https://blackforestlabs.ai, 2024

  3. [3]

    Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

  4. [4]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

  5. [5]

    Learning vi- sual importance for graphic designs and data visualizations

    Zoya Bylinskii, Nam Wook Kim, Peter O’Donovan, Sami Alsheikh, Spandan Madan, Hanspeter Pfister, Fredo Durand, Bryan Russell, and Aaron Hertzmann. Learning vi- sual importance for graphic designs and data visualizations. InProceedings of the 30th Annual ACM Symposium on User Interface Software and Technology (UIST), 2017

  6. [6]

    Seedream: Native high-resolution bilingual image generation foundation model

    ByteDance. Seedream: Native high-resolution bilingual image generation foundation model. Technical report, ByteDance, 2024. 20ZHU ET AL.: TASTE: DESIGNER-ANNOTA TED PREFERENCE FOR AI DESIGN

  7. [7]

    Exploring the naturalness of AI- generated images.IEEE Transactions on Circuits and Systems for Video Technology, 2024

    Zijian Chen, Wei Sun, Yuan Tian, Jun Jia, Zicheng Zhang, Jiarui Wang, Ru Huang, Xiongkuo Min, Guangtao Zhai, and Wenjun Zhang. Exploring the naturalness of AI- generated images.IEEE Transactions on Circuits and Systems for Video Technology, 2024

  8. [8]

    Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks

    Adrienne Deganutti, Elad Hirsch, Haonan Zhu, Jaejung Seol, and Purvanshi Mehta. Graphic-Design-Bench: A comprehensive benchmark for evaluating AI on graphic de- sign tasks. arXiv preprint arXiv:2604.04192, 2026

  9. [9]

    Predicting visual importance across graphic design types

    Camilo Fosco, Vincent Casser, Amish Kumar Bedi, Peter O’Donovan, Aaron Hertz- mann, and Zoya Bylinskii. Predicting visual importance across graphic design types. arXiv preprint arXiv:2008.02912, 2020

  10. [10]

    I-HallA: Evaluating image hallucination in text-to-image generation with question answering

    Sang gil Lim, Heesoo Jung, Choonghan Kim, Hyunwoo Park, Hwanhee Lee, and Pil- sung Kang. I-HallA: Evaluating image hallucination in text-to-image generation with question answering. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2025

  11. [11]

    Imagen and the Nano-Banana image generator

    Google DeepMind. Imagen and the Nano-Banana image generator. Technical report, Google DeepMind, 2024

  12. [12]

    DistortBench: Benchmarking Vision Language Models on Image Distortion Identification

    Divyanshu Goyal, Akhil Eppa, and Vanya Bannihatti Kumar. DistortBench: Bench- marking vision language models on image distortion identification.arXiv preprint arXiv:2604.19966, 2026

  13. [13]

    Maxwell Harper and Joseph A

    F. Maxwell Harper and Joseph A. Konstan. The MovieLens datasets: History and context.ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4):19:1–19:19, 2015

  14. [14]

    LICA: Lay- ered image composition annotations for graphic design research

    Elad Hirsch, Shubham Yadav, Mohit Garg, and Purvanshi Mehta. LICA: Lay- ered image composition annotations for graphic design research. arXiv preprint arXiv:2603.16098, 2026

  15. [15]

    VBench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogniti...

  16. [16]

    Nantonac collaborative filtering: Recommendation based on order responses

    Toshihiro Kamishima. Nantonac collaborative filtering: Recommendation based on order responses. Technical report, ACM SIGKDD, 2003. Sushi preference dataset

  17. [17]

    Maurice G. Kendall. A new measure of rank correlation.Biometrika, 30(1/2):81–93, 1938

  18. [18]

    Pick-a-Pic: An open dataset of user preferences for text-to-image generation

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-Pic: An open dataset of user preferences for text-to-image generation. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  19. [19]

    Reliability in content analysis: Some common misconceptions and recommendations.Human Communication Research, 30(3):411–433, 2004

    Klaus Krippendorff. Reliability in content analysis: Some common misconceptions and recommendations.Human Communication Research, 30(3):411–433, 2004. ZHU ET AL.: TASTE: DESIGNER-ANNOTA TED PREFERENCE FOR AI DESIGN21

  20. [20]

    SAGE, 4th edition, 2018

    Klaus Krippendorff.Content Analysis: An Introduction to Its Methodology. SAGE, 4th edition, 2018

  21. [21]

    GenAI-Bench: Evaluating and improving compositional text-to-visual generation

    Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, and Deva Ramanan. GenAI-Bench: Evaluating and improving compositional text-to-visual generation. InAdvances in Neu- ral Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024

  22. [22]

    AGIQA-3K: An open database for AI-generated image quality assessment.IEEE Transactions on Circuits and Systems for Video Technology, 2023

    Chunyi Li, Zicheng Zhang, Haoning Wu, Wei Sun, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, and Weisi Lin. AGIQA-3K: An open database for AI-generated image quality assessment.IEEE Transactions on Circuits and Systems for Video Technology, 2023

  23. [23]

    AIGIQA-20K: A large database for AI-generated image quality assessment

    Chunyi Li, Tengchuan Kou, Yixuan Gao, Yuqin Cao, Wei Sun, Zicheng Zhang, Yingjie Zhou, Zhichao Zhang, Weixia Zhang, Haoning Wu, Xiaohong Liu, Xiongkuo Min, and Guangtao Zhai. AIGIQA-20K: A large database for AI-generated image quality assessment. InCVPR Workshops (NTIRE), 2024

  24. [24]

    Qwen3- vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multi- modal retrieval and ranking.arXiv, 2026

    Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3- vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multi- modal retrieval and ranking.arXiv, 2026

  25. [25]

    Rich human feedback for text-to-image generation

    Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Car- olan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, et al. Rich human feedback for text-to-image generation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024

  26. [26]

    A survey of multimodal hallucination evaluation and detection

    Hao Liu et al. A survey of multimodal hallucination evaluation and detection. arXiv preprint arXiv:2507.19024, 2025

  27. [27]

    Duncan Luce.Individual Choice Behavior: A Theoretical Analysis

    R. Duncan Luce.Individual Choice Behavior: A Theoretical Analysis. John Wiley & Sons, 1959

  28. [28]

    HPSv3: Towards wide- spectrum human preference score

    Yuhang Ma, Xiaoshi Wu, Keqiang Sun, Hongsheng Li, et al. HPSv3: Towards wide- spectrum human preference score. InProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), 2025

  29. [29]

    C. L. Mallows. Non-null ranking models.Biometrika, 44(1/2):114–130, 1957

  30. [30]

    GPT-Image and DALL·E 3: Text-to-image generation

    OpenAI. GPT-Image and DALL·E 3: Text-to-image generation. Technical report, OpenAI, 2024

  31. [31]

    Bigham, and Jason Wu

    Yi-Hao Peng, Jeffrey P. Bigham, and Jason Wu. DesignPref: Capturing personal pref- erences in visual design generation. arXiv preprint arXiv:2511.20513, 2025

  32. [32]

    Plackett

    Robin L. Plackett. The analysis of permutations.Journal of the Royal Statistical Soci- ety: Series C, 24(2):193–202, 1975

  33. [33]

    Manning, Stefano Er- mon, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Er- mon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 22ZHU ET AL.: TASTE: DESIGNER-ANNOTA TED PREFERENCE FOR AI DESIGN

  34. [34]

    ImagenWorld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks

    Tao Sun et al. ImagenWorld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks. arXiv preprint arXiv:2603.27862, 2026

  35. [35]

    Identifying and mitigating position bias of multi-image vision-language models

    Yu Tian, Tianqi Liu, Zhiyuan Liu, Jie Yang, and Cordelia Schmid. Identifying and mitigating position bias of multi-image vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2503.13792

  36. [36]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hong- sheng Li. Human Preference Score v2: A solid benchmark for evaluating human pref- erences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023

  37. [37]

    ImageReward: Learning and evaluating human preferences for text-to- image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageReward: Learning and evaluating human preferences for text-to- image generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  38. [38]

    VisionReward: Fine-grained multi- dimensional human preference learning for image and video generation

    Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qinkai Li, Mingyi Zhang, et al. VisionReward: Fine-grained multi- dimensional human preference learning for image and video generation. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

  39. [39]

    Learning multi-dimensional human preference for text-to-image generation

    Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, and Zhongyuan Wang. Learning multi-dimensional human preference for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2024

  40. [40]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Wang, Jingwen He, Fan Zhang, Yuanhan Zhang, Jingkang Yang, Yu Qiao, and Ziwei Liu. VBench-2.0: Advancing video gen- eration benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755, 2025

  41. [41]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonza- lez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Bench- marks Track, 2023

  42. [42]

    From fragment to one piece: A survey on AI-driven graphic design

    Xingxing Zou, Wen Zhang, and Nanxuan Zhao. From fragment to one piece: A survey on AI-driven graphic design. arXiv preprint arXiv:2503.18641, 2025. A Statistical definitions This appendix collects the formal definitions, support sets, and null PMFs of the three signal- test statistics introduced in §4. Notation: a sample is one prompt rated byR=5 evaluato...