pith. machine review for the scientific record. sign in

arxiv: 2605.12684 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.AI· cs.HC

Recognition: unknown

Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.HC
keywords visual aestheticsmultimodal large language modelsaesthetic benchmarkcomparative judgmentimage selectionexpert evaluation
0
0 comments X

The pith

Frontier multimodal models correctly pick both the best and worst image in only 26.5 percent of controlled aesthetic tasks, while human experts reach 68.9 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether multimodal large language models can make reliable aesthetic judgments by comparing sets of images that share the same subject. It shows that assigning scalar scores to single images aligns poorly with direct comparisons, whereas set-based ranking produces clearer expert agreement. To measure model performance, the authors built the Visual Aesthetic Benchmark with 400 tasks and expert consensus labels across art, photography, and illustration. Even the strongest models succeed at identifying both the top and bottom images consistently across order permutations in just over a quarter of cases. Fine-tuning a smaller model on expert examples narrows the gap toward larger systems, indicating the comparative signal is learnable.

Core claim

The Visual Aesthetic Benchmark shows that frontier MLLMs reach only 26.5 percent accuracy at correctly naming both the best and worst image across three random orderings in 400 tasks, far below the 68.9 percent rate achieved by human experts, and that score-derived rankings match direct comparisons poorly.

What carries the argument

The Visual Aesthetic Benchmark (VAB), a collection of 400 comparative selection tasks over matched-subject image sets labeled by consensus of 10 expert judges each.

If this is right

  • Direct comparative selection over image sets yields higher inter-annotator agreement than scalar scoring for aesthetic preference.
  • Fine-tuning on roughly 2,000 expert-labeled examples brings a 35B-parameter model near the accuracy of a 397B-parameter open model.
  • The benchmark supplies a reusable testbed for measuring and closing the gap between models and expert aesthetic judgment.
  • Current multimodal systems remain limited for applications that require consistent curation or selection among visually similar options.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Repeated use of this benchmark could track whether future models close the observed performance difference over time.
  • The same comparative format might reveal similar gaps in other subjective visual domains such as design or product photography.
  • Models that improve on VAB could be integrated into generation pipelines to filter outputs for higher visual appeal.

Load-bearing premise

Expert consensus on comparative preference within matched-subject image sets forms a stable and representative ground truth for aesthetic quality.

What would settle it

A model that exceeds 60 percent accuracy on the same 400 tasks or an expanded version validated by fresh experts would undermine the claim of a substantial persistent gap.

read the original abstract

Multimodal large language models (MLLMs) are now routinely deployed for visual understanding, generation, and curation. A substantial fraction of these applications require an explicit aesthetic judgment. Most existing solutions reduce this judgment to predicting a scalar score for a single image. We first ask whether such scores faithfully capture comparative preference: in a controlled study with eight expert annotators, score-derived rankings align poorly with the same annotators' direct comparisons, while direct ranking yields substantially higher inter-annotator agreement on best- and worst-image labels. Motivated by this finding, we introduce the Visual Aesthetic Benchmark (VAB), which casts aesthetic evaluation as comparative selection over candidate sets with matched subject matter. VAB contains 400 tasks and 1,195 images across fine art, photography, and illustration, with labels derived from the consensus of 10 independent expert judges per task. Evaluating 20 frontier MLLMs and six dedicated visual-quality reward models, we find that the strongest system identifies both the best and the worst image correctly across three random permutations of the candidate order in only 26.5% of tasks, far below the 68.9% achieved by human experts. Fine-tuning a 35B-parameter model on 2,000 expert examples brings its accuracy close to that of a 397B-parameter open-weight model, suggesting that the comparative signal in VAB is transferable. Together, these results expose a clear and measurable gap between current multimodal models and expert aesthetic judgment, and VAB provides the first set-based, expert-grounded testbed on which that gap can be tracked and closed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Visual Aesthetic Benchmark (VAB) as a comparative, set-based evaluation for multimodal large language models' aesthetic judgment capabilities. It first shows via an expert study that direct ranking tasks yield higher inter-annotator agreement than score-derived rankings, then reports that the strongest of 20 frontier MLLMs correctly identifies both the best and worst images across three random permutations in only 26.5% of 400 tasks (vs. 68.9% for human experts). Fine-tuning a 35B model on 2,000 expert examples is shown to approach the performance of a 397B open-weight model, exposing a measurable gap between current models and expert aesthetic judgment.

Significance. If the results hold, VAB supplies a useful expert-grounded testbed for tracking progress on aesthetic understanding in MLLMs, an area directly relevant to image curation, generation, and reward modeling. The controlled finding that direct comparative preference outperforms scalar scoring, together with evidence that the comparative signal is transferable via fine-tuning, provides actionable insight. The reported 26.5% vs. 68.9% gap quantifies a concrete deficit that future work can target.

major comments (3)
  1. [Abstract] Abstract: the central claim of a 26.5% vs. 68.9% performance gap rests on the 10-expert consensus constituting stable ground truth, yet the abstract reports no split-half reliability, cross-panel verification, or quantitative inter-annotator agreement statistics (e.g., Fleiss' kappa or pairwise agreement rates) on the best/worst labels; without these, the gap could partly reflect annotator-pool idiosyncrasies rather than a generalizable model deficit.
  2. [Abstract] Abstract: no error bars, confidence intervals, or statistical significance tests are provided for the 26.5% and 68.9% figures, nor for the claim that fine-tuning brings a 35B model close to a 397B model; these omissions make it impossible to assess whether the reported gap is robust to sampling variation across the 400 tasks.
  3. [Abstract] Abstract: the construction of the 400 tasks and the criteria used to ensure matched subject matter across candidate images are not described, leaving open whether the benchmark sufficiently represents the space of visual aesthetic judgment or inadvertently favors certain image types.
minor comments (2)
  1. The abstract mentions 1,195 images across fine art, photography, and illustration but provides no breakdown of the distribution or selection process.
  2. No data-release statement or link to the VAB tasks and annotations is included, which would be needed for reproducibility and follow-up work.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our benchmark's reliability and construction. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of a 26.5% vs. 68.9% performance gap rests on the 10-expert consensus constituting stable ground truth, yet the abstract reports no split-half reliability, cross-panel verification, or quantitative inter-annotator agreement statistics (e.g., Fleiss' kappa or pairwise agreement rates) on the best/worst labels; without these, the gap could partly reflect annotator-pool idiosyncrasies rather than a generalizable model deficit.

    Authors: We agree that the abstract should explicitly substantiate the stability of the expert consensus labels. We will revise the abstract to report quantitative inter-annotator agreement (Fleiss' kappa and pairwise rates) along with split-half reliability results for the best/worst selections. This addition will directly address concerns about potential annotator idiosyncrasies. revision: yes

  2. Referee: [Abstract] Abstract: no error bars, confidence intervals, or statistical significance tests are provided for the 26.5% and 68.9% figures, nor for the claim that fine-tuning brings a 35B model close to a 397B model; these omissions make it impossible to assess whether the reported gap is robust to sampling variation across the 400 tasks.

    Authors: We acknowledge that uncertainty estimates are needed to evaluate robustness. We will add bootstrap confidence intervals for the 26.5% and 68.9% accuracies and include statistical significance tests comparing the fine-tuned 35B model to the 397B model. These will appear in the revised abstract and be detailed in the results. revision: yes

  3. Referee: [Abstract] Abstract: the construction of the 400 tasks and the criteria used to ensure matched subject matter across candidate images are not described, leaving open whether the benchmark sufficiently represents the space of visual aesthetic judgment or inadvertently favors certain image types.

    Authors: We agree a concise description of task construction belongs in the abstract. We will revise the abstract to briefly state how the 400 tasks were assembled from the 1,195 images (fine art, photography, illustration) with subject-matter matching based on thematic and stylistic consistency within each set. Full curation details remain in Section 3. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark built from independent expert annotations

full rationale

The paper introduces VAB as a new benchmark whose labels are derived directly from fresh consensus of 10 expert judges per task across 400 tasks. No equations, fitted parameters, or self-citations appear in the derivation of the core results (model accuracy of 26.5% vs. human 68.9%). The comparative preference evaluation is an empirical measurement against held-out expert labels rather than a reduction to prior inputs or self-referential definitions. The construction is self-contained against external benchmarks and does not invoke any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that expert consensus on comparative aesthetic preference supplies a reliable ground truth; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Expert consensus on comparative preference over matched-subject image sets constitutes a stable ground truth for aesthetic quality.
    The benchmark derives all labels from the consensus of 10 independent expert judges per task.

pith-pipeline@v0.9.0 · 5656 in / 1306 out tokens · 41331 ms · 2026-05-14T20:56:56.884329+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

  1. [2]

    Color and Tonal Relationships • Are the color relationships more harmonious or layered? • Is there stronger control over warmth/coolness and light/dark contrast? • Does the work avoid muddiness, dullness, or scattered color impressions?

  2. [6]

    • Any contextual or background information beyond what is visible in the image itself

    Final Selection • Considering the set as a whole, which image is the strongest work? • If the set contains three or more images, which image is the weakest work? Should Not Be Considered • Personal preference for subject matter. • Any contextual or background information beyond what is visible in the image itself. 21 BakeAI C Annotation Details Photograph...

  3. [10]

    Color or Tonal Rendering • Is white balance or color temperature more accurate or more intentional? • Are colors or tonal values more consistent and unified? • Does the image avoid unwanted color casts or muddy tones?

  4. [12]

    • Assumptions about shooting difficulty

    Final Selection • Considering the set as a whole, which photograph is the strongest image? • If the set contains three or more images, which photograph is the weakest image? Should Not Be Considered • The personal appeal or attractiveness of the subject. • Assumptions about shooting difficulty. • Assumptions about post-processing effort. Digital Illustrat...

  5. [13]

    Visual Focus and Information Clarity • Is the visual focal point clearer? • Does the visual flow guide the viewer more effectively? • Is there less visual noise or competition for attention? 22 BakeAI C Annotation Details

  6. [18]

    more scientifically sound

    Final Selection • Considering the set as a whole, which illustration is the strongest work? • If the set contains three or more images, which illustration is the weakest work? Should Not Be Considered • Personal preference for illustration style. • Software or tools used. • Hypothetical or imagined client requirements. 23 BakeAI D Ground Truth Mathematica...

  7. [19]

    We computeτ for each annotator within each task, average across annotators to obtain a task-level value, and report the mean and standard deviation across tasks

    , where C and D are the numbers of concordant and discordant pairs. We computeτ for each annotator within each task, average across annotators to obtain a task-level value, and report the mean and standard deviation across tasks. We additionally report a top-1 self-consistency (SC) rate: the fraction of annotator–task pairs for whichσ score a and the dire...

  8. [20]

    Composition and Visual Order • Is the primary subject clearly defined and visually grounded? • Does the composition feel stable or intentionally dynamic? • Is the spatial arrangement natural, without unnecessary congestion or emptiness?

  9. [21]

    Color and Tonal Relationships • Are color relationships harmonious and layered? • Is there effective control of warmth/coolness and light/dark contrast? • Does the work avoid muddiness, dullness, or scattered color impressions?

  10. [22]

    Technical Control • Are the brushstrokes confident, fluid, and rhythmic? • Are there signs of hesitation, redundancy, or ineffective marks? • Does technique serve the image rather than draw attention to itself?

  11. [23]

    Degree of Completion • Are there areas that appear unfinished or perfunctory? • Are edges and transitions handled naturally? • Does the work feel resolved, at a point where the artist could reasonably stop?

  12. [24]

    Artistic Expression and Overall Character • Is the mood or atmosphere focused and coherent? • Does the work convey a unified overall character or presence? • Does it demonstrate strong artistic persuasiveness?

  13. [25]

    Should Not Be Considered • Personal preference for subject matter

    Overall Score • Based on all criteria above, assign an overall score from 0 to 10, accurate to one decimal place. Should Not Be Considered • Personal preference for subject matter. • Artist background, intent, or any contextual information beyond what is visible in the image itself. 40 BakeAI F Human Study Details Photography Single-Image Scoring Rubric F...

  14. [26]

    Composition and Image Structure • Is the subject clearly emphasized? • Does the frame feel balanced or intentionally dynamic? • Is the background clean and supportive of the subject?

  15. [27]

    Quality of Light • Is the light natural or expressive? • Are highlights and shadows well controlled? • Does the image avoid crushed blacks or blown highlights?

  16. [28]

    Sharpness and Focus • Is focus accurate where it matters most? • Is the relationship between sharpness and blur appropriate? • Are there any noticeable technical flaws?

  17. [29]

    Color or Tonal Rendering • Is white balance or color temperature accurate or clearly intentional? • Are colors or tonal values consistent and unified? • Does the image avoid unwanted color casts or muddy tones?

  18. [30]

    Moment and Emotional Expression • Does the image capture a meaningful moment? • Is the emotional or narrative content clear and compelling? • Does the image convey a strong sense of photographic presence?

  19. [31]

    Should Not Be Considered • The personal appeal or attractiveness of the subject

    Overall Score • Based on all criteria above, assign an overall score from 0 to 10, accurate to one decimal place. Should Not Be Considered • The personal appeal or attractiveness of the subject. • Assumptions about shooting difficulty. • Assumptions about post-processing effort. Digital Illustration Single-Image Scoring Rubric For a single digital illustr...

  20. [32]

    Visual Focus and Information Clarity • Is the visual focal point clear and well defined? • Does the visual flow guide the viewer smoothly? • Is there minimal visual noise or competition for attention? 41 BakeAI F Human Study Details

  21. [33]

    Form and Proportion • Are the structures of figures or objects accurate? • Are poses and silhouettes aesthetically refined? • Does the image avoid stiffness, distortion, or unnatural forms?

  22. [34]

    Color Design • Is the color palette unified and thematically coherent? • Are value and lighting hierarchies clear? • Is contrast and visual rhythm used effectively?

  23. [35]

    Detail Control and Finish • Are details handled appropriately, neither excessive nor careless? • Are materials and lighting consistent throughout the image? • Are edges, masks, and gradients clean and well controlled?

  24. [36]

    Stylistic Consistency and Purpose Fit • Does the illustration align with its intended stylistic direction? • Is quality consistent within a unified style? • Is the image suitable for its potential use case?

  25. [37]

    Should Not Be Considered • Personal preference for illustration style

    Overall Score • Based on all criteria above, assign an overall score from 0 to 10, accurate to one decimal place. Should Not Be Considered • Personal preference for illustration style. • Software or tools used. • Hypothetical or imagined client requirements. When converting absolute scores into induced rankings, no exact within-task score ties were observ...

  26. [38]

    Composition and Visual Order • Is the primary subject more clearly defined? • Does the composition feel more stable or intentionally dynamic? • Is the spatial arrangement natural, without unnecessary congestion or emptiness?

  27. [39]

    Color and Tonal Relationships • Are the color relationships more harmonious or layered? • Is there stronger control over warmth/coolness and light/dark contrast? 42 BakeAI F Human Study Details • Does the work avoid muddiness, dullness, or scattered color impressions?

  28. [40]

    Technical Control • Are the brushstrokes more confident, fluid, and rhythmic? • Are there signs of hesitation, redundancy, or ineffective marks? • Does the technique serve the image rather than draw attention to itself?

  29. [41]

    Degree of Completion • Are there any areas that appear unfinished or perfunctory? • Are edges and transitions handled naturally? • Does the work feel resolved, at a point where the artist could reasonably stop?

  30. [42]

    Artistic Expression and Overall Character • Is the mood or atmosphere more focused and coherent? • Does the work convey a stronger overall character or presence? • Does it have greater artistic persuasiveness?

  31. [43]

    Should Not Be Considered • Personal preference for subject matter

    Final Selection • Considering the set as a whole, determine the relative ordering of the images from strongest to weakest. Should Not Be Considered • Personal preference for subject matter. • Any contextual or background information beyond what is visible in the image itself. Photography Ranking Rubric For a set of photographs taken of similar subjects, i...

  32. [44]

    Composition and Image Structure • Is the subject more clearly emphasized? • Is the frame more balanced or intentionally dynamic? • Is the background cleaner and more supportive of the subject?

  33. [45]

    Quality of Light • Is the light more natural or more expressive? • Are highlights and shadows better controlled? • Does the image avoid crushed blacks or blown highlights?

  34. [46]

    Sharpness and Focus • Is focus more accurate where it matters? • Is the relationship between sharpness and blur more appropriate? • Are there any technical errors that detract from the image?

  35. [47]

    Color or Tonal Rendering • Is white balance or color temperature more accurate or more intentional? • Are colors or tonal values more consistent and unified? • Does the image avoid unwanted color casts or muddy tones? 43 BakeAI F Human Study Details

  36. [48]

    Moment and Emotional Expression • Does the image capture a stronger or more meaningful moment? • Is the emotional or narrative content more compelling? • Does it carry a stronger sense of photographic presence?

  37. [49]

    Should Not Be Considered • The personal appeal or attractiveness of the subject

    Final Selection • Considering the set as a whole, determine the relative ordering of the photographs from strongest to weakest. Should Not Be Considered • The personal appeal or attractiveness of the subject. • Assumptions about shooting difficulty. • Assumptions about post-processing effort. Digital Illustration Ranking Rubric For a set of digital illust...

  38. [50]

    Visual Focus and Information Clarity • Is the visual focal point clearer? • Does the visual flow guide the viewer more effectively? • Is there less visual noise or competition for attention?

  39. [51]

    Form and Proportion • Are the structures of figures or objects more accurate? • Are poses and silhouettes more aesthetically refined? • Does the image avoid stiffness, distortion, or unnatural forms?

  40. [52]

    Color Design • Is the color palette more unified and thematically coherent? • Are value and lighting hierarchies clearer? • Is there effective use of contrast and visual rhythm?

  41. [53]

    Detail Control and Finish • Are details handled appropriately, neither excessive nor careless? • Are materials and lighting consistent throughout? • Are edges, masks, and gradients clean and well controlled?

  42. [54]

    Stylistic Consistency and Purpose Fit • Does the illustration better align with its intended stylistic direction? • Is quality more consistent within a unified style? • Is the image more suitable for its potential use case?

  43. [55]

    Should Not Be Considered 44 BakeAI F Human Study Details • Personal preference for illustration style

    Final Selection • Considering the set as a whole, determine the relative ordering of the illustrations from strongest to weakest. Should Not Be Considered 44 BakeAI F Human Study Details • Personal preference for illustration style. • Software or tools used. • Hypothetical or imagined client requirements. F.5 Notation Symbol Meaning A=8 number of annotato...

  44. [56]

    Please evaluate the aesthetic quality of this image from the aspect of Composition & Design

  45. [57]

    Please evaluate the aesthetic quality of this image from the aspect of Visual Elements & Structure

  46. [58]

    Please evaluate the aesthetic quality of this image from the aspect of Technical Execution

  47. [59]

    Please evaluate the aesthetic quality of this image from the aspect of Originality & Creativity

  48. [60]

    Please evaluate the aesthetic quality of this image from the aspect of Theme & Communication

  49. [61]

    Please evaluate the aesthetic quality of this image from the aspect of Emotion & Viewer Response

  50. [62]

    Please evaluate the aesthetic quality of this image from the aspect of Overall Gestalt

  51. [63]

    Photo": [

    Please evaluate the aesthetic quality of this image from the aspect of Comprehensive Evaluation. Critique Aggregation Format The eight responses are concatenated into a singlereviewstring in the following format: [Composition & Design] <response> [Visual Elements & Structure] <response> [Technical Execution] <response> [Originality & Creativity] <response...

  52. [64]

    Classify the image: choose the most appropriate domain and substyle based on image itself

  53. [65]

    Freely decide how to edit this image to create better and worse versions

  54. [66]

    Write a "better" prompt: describe how to edit THIS image to make it aesthetically better

  55. [67]

    worse" prompt: describe how to edit THIS image to make it aesthetically worse -

    Write a "worse" prompt: describe how to edit THIS image to make it aesthetically worse - "Better" means: take the current image and improve its quality - "Worse" means: take the current image and degrade its quality. EDIT Types - Do NOT always default to color/tone adjustments. Actively consider all edit types:

  56. [68]

    **Color & Tonal Adjustments**: - White balance, color cast correction or introduction - Saturation, vibrance, contrast adjustments - Color grading and mood shifts

  57. [69]

    **Perspective & Viewpoint Changes**: - Simulate a slightly different camera angle or position - Adjust the apparent depth or distance from the subject - Change the sense of scale or spatial relationships

  58. [70]

    **Element Removal**: - Remove contextual elements that enhance (or clutter) the scene - Modify background elements to improve or degrade composition

  59. [71]

    **Compositional Restructuring**: - Adjust cropping or framing emphasis - Rebalance visual weight distribution - Modify leading lines or focal points

  60. [72]

    content": {

    **Light & Shadow Changes**: - Modify shadow depth, highlight recovery, or exposure - Change lighting direction or intensity - Adjust local brightness to guide visual attention Choose multiple edit types that best address the specific strengths and weaknesses identified in the aesthetic assessment. WORSE PROMPT RESTRICTIONS: 57 BakeAI H Prompt Templates an...