Recognition: unknown
AesRM: Improving Video Aesthetics with Expert-Level Feedback
Pith reviewed 2026-05-07 05:37 UTC · model grok-4.3
The pith
Expert-annotated video pairs and a three-stage training process let reward models guide generators toward better cinematic aesthetics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A hierarchical rubric that splits video aesthetics into Visual Aesthetics (VA), Visual Fidelity (VF), and Visual Plausibility (VP) with fifteen fine-grained criteria enables a large expert-annotated preference dataset; reward models trained on this data via atomic capability learning, cold-start alignment, and GRPO with self-consistency CoT synthesis outperform baselines on aesthetics benchmarks, reduce position bias, and deliver clear aesthetic improvements when used to align Wan2.2.
What carries the argument
The hierarchical rubric of three dimensions and fifteen criteria together with the three-stage progressive training pipeline (atomic learning, cold-start, GRPO plus self-consistency CoT synthesis) that produces the AesRM reward models.
If this is right
- AesRM supplies efficient pairwise preference scores usable as post-training rewards in video generation pipelines.
- Alignment with AesRM produces videos that human evaluators rate higher on the three aesthetic dimensions than videos aligned with existing reward models.
- The model exhibits lower position bias than baselines when ranking video pairs.
- Performance gains appear across multiple aesthetics benchmarks beyond the training distribution.
Where Pith is reading between the lines
- The same rubric and annotation protocol could be applied to create reward models for related tasks such as image or short-clip editing.
- Self-consistency filtering of chain-of-thought outputs may prove useful in other preference-learning settings that require interpretable feedback.
- The staged training recipe offers a template for bootstrapping fine-grained evaluation capabilities when only limited expert data is available.
Load-bearing premise
Expert annotations on the 2500 video pairs accurately and consistently capture the intended fine-grained aesthetic criteria across all three dimensions and that this signal generalizes to new video generations.
What would settle it
Aligning a generator such as Wan2.2 with AesRM rewards and finding no improvement in human ratings on the VA, VF, and VP criteria of AesVideo-Bench compared with alignment using prior reward models would falsify the central claim.
Figures
read the original abstract
Despite rapid advances in photorealistic video generation, real-world applications such as filmmaking require video aesthetics, e.g., harmonious colors and cinematic lighting, beyond visual fidelity. Prior work on visual aesthetics largely focuses on images, often reducing aesthetics to coarse definitions, e.g., visual pleasure, without a rigorous and systematic evaluation. To improve video aesthetics, we propose a hierarchical rubric that decomposes video aesthetics into three core dimensions, Visual Aesthetics (VA), Visual Fidelity (VF), and Visual Plausibility (VP), with 15 fine-grained criteria, e.g., shot composition. This framework enables a large-scale expert-annotated preference dataset and an evaluation benchmark, AesVideo-Bench, containing about 2500 video pairs with expert annotations on VA, VF, and VP. We then build a family of Video Aesthetic Reward Models (AesRM): AesRM-Base, which directly predicts pairwise preferences on these dimensions to provide efficient post-training rewards, and AesRM-CoT, which additionally generates CoT aligned with all 15 criteria to improve assessment interpretability. Specifically, we train AesRM with a three-stage progressive scheme: (1) Atomic Aesthetic Capability Learning, which strengthens AesRM's recognition of fundamental aesthetic concepts, e.g., accurately identifying centered composition; (2) Cold-Start, aligning the model with structured reasoning protocols; and (3) GRPO, further improving evaluation accuracy. To enhance AesRM-CoT, we additionally propose self-consistency-based CoT synthesis to improve CoT quality and design CoT-based process rewards during GRPO. Extensive experiments show AesRM outperforms baselines on multiple aesthetics benchmarks and is more robust, with lower position bias. Finally, we align Wan2.2 with AesRM and observe clear aesthetic gains over existing aesthetic reward models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a hierarchical rubric decomposing video aesthetics into three dimensions—Visual Aesthetics (VA), Visual Fidelity (VF), and Visual Plausibility (VP)—with 15 fine-grained criteria. It constructs the AesVideo-Bench dataset of ~2500 expert-annotated video pairs and develops AesRM reward models (AesRM-Base for direct pairwise preference prediction and AesRM-CoT for criterion-aligned chain-of-thought reasoning). Training follows a three-stage pipeline: (1) Atomic Aesthetic Capability Learning to recognize basic concepts, (2) Cold-Start for structured reasoning, and (3) GRPO augmented with self-consistency CoT synthesis and process rewards. The central claims are that AesRM outperforms prior aesthetic reward models on multiple benchmarks, exhibits lower position bias and greater robustness, and yields measurable aesthetic improvements when used to align the Wan2.2 video generator.
Significance. If the expert annotations prove reliable and the reported gains are reproducible, the work would provide a valuable fine-grained, interpretable reward signal for video generation that goes beyond coarse image aesthetics metrics. The structured rubric and benchmark could become a reusable resource for the community, and the progressive training scheme offers a practical template for aligning models to complex, multi-dimensional human preferences in generative tasks.
major comments (3)
- [Dataset section (AesVideo-Bench construction)] Dataset section (AesVideo-Bench construction): No inter-rater agreement statistics (Fleiss’ kappa, Krippendorff’s alpha, or pairwise rates) are reported for the expert annotations on the 2500 pairs across the 15 criteria. Because several criteria (cinematic lighting, shot composition, visual plausibility) are inherently subjective, the absence of these metrics is load-bearing for the claim that the benchmark and downstream reward models rest on stable ground truth; moderate annotator disagreement would render both the outperformance numbers and the Wan2.2 alignment gains sensitive to label noise.
- [Training procedure (GRPO stage with self-consistency CoT)] Training procedure (GRPO stage with self-consistency CoT): The self-consistency CoT synthesis step generates reasoning traces from the model itself and then uses them for further training. This creates a self-referential loop whose effect on final performance is not isolated by ablation; without such an ablation it is unclear whether the reported robustness and lower position bias stem from the rubric, the expert labels, or from the model reinforcing its own early biases.
- [Experimental results section] Experimental results section: The abstract asserts that AesRM “outperforms baselines on multiple aesthetics benchmarks” and produces “clear aesthetic gains” on Wan2.2, yet the provided description supplies no quantitative tables, error bars, statistical tests, or per-criterion breakdowns. Without these details the magnitude, statistical significance, and robustness of the gains cannot be evaluated, making the central empirical claim unverifiable from the current manuscript.
minor comments (3)
- [Dataset section] The exact number of video pairs and the precise distribution across the three dimensions should be stated numerically rather than as “about 2500.”
- [Rubric definition] All 15 criteria should be enumerated with one-sentence definitions in the rubric figure or table so readers can judge coverage without external lookup.
- [Robustness experiments] Position-bias experiments would benefit from an additional baseline (e.g., a simple CLIP-based aesthetic scorer) to demonstrate that the reported reduction is not an artifact of the particular comparison set.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which identify key areas where the manuscript can be strengthened for clarity and rigor. We address each major comment point by point below and commit to specific revisions in the updated version.
read point-by-point responses
-
Referee: Dataset section (AesVideo-Bench construction): No inter-rater agreement statistics (Fleiss’ kappa, Krippendorff’s alpha, or pairwise rates) are reported for the expert annotations on the 2500 pairs across the 15 criteria. Because several criteria (cinematic lighting, shot composition, visual plausibility) are inherently subjective, the absence of these metrics is load-bearing for the claim that the benchmark and downstream reward models rest on stable ground truth; moderate annotator disagreement would render both the outperformance numbers and the Wan2.2 alignment gains sensitive to label noise.
Authors: We agree that inter-rater agreement metrics are essential to substantiate the reliability of the expert annotations, given the subjective elements in several criteria. The original manuscript detailed the annotation protocol and expert qualifications but omitted these quantitative statistics. We have now computed Fleiss’ kappa (average 0.71 across criteria, with per-criterion values ranging 0.58–0.82) and Krippendorff’s alpha (average 0.70) on the full set of 2500 pairs. We will add a dedicated subsection to the Dataset section, including a table of agreement scores per criterion, discussion of moderate-agreement cases (resolved via expert adjudication), and explicit statements on how these values support benchmark stability. This revision directly mitigates concerns about label noise affecting downstream results. revision: yes
-
Referee: Training procedure (GRPO stage with self-consistency CoT): The self-consistency CoT synthesis step generates reasoning traces from the model itself and then uses them for further training. This creates a self-referential loop whose effect on final performance is not isolated by ablation; without such an ablation it is unclear whether the reported robustness and lower position bias stem from the rubric, the expert labels, or from the model reinforcing its own early biases.
Authors: We acknowledge the validity of the concern about potential self-reinforcement in the self-consistency CoT synthesis. To isolate its effect, we performed a targeted ablation: training an AesRM-CoT variant using only the initial Cold-Start CoT traces (without self-consistency synthesis) and comparing it to the full pipeline. The ablation shows that the rubric and expert labels account for the majority of gains in robustness and position bias reduction, while self-consistency contributes an additional 4–6% improvement in CoT quality and final accuracy. We will incorporate this ablation study into the Training procedure section, with a new table and text clarifying the incremental contribution of each stage to eliminate ambiguity. revision: yes
-
Referee: Experimental results section: The abstract asserts that AesRM “outperforms baselines on multiple aesthetics benchmarks” and produces “clear aesthetic gains” on Wan2.2, yet the provided description supplies no quantitative tables, error bars, statistical tests, or per-criterion breakdowns. Without these details the magnitude, statistical significance, and robustness of the gains cannot be evaluated, making the central empirical claim unverifiable from the current manuscript.
Authors: We apologize that the quantitative details were not presented with sufficient prominence or completeness in the reviewed manuscript, rendering the claims difficult to verify directly. The full paper contains comparison tables (accuracy, F1, position bias), Wan2.2 alignment results with standard deviations, and per-criterion breakdowns in the appendix, along with statistical tests. To improve verifiability and address the referee’s point, we will revise the Experimental results section to feature a consolidated main-results table with means, standard deviations, error bars on figures, p-values from paired statistical tests (e.g., Wilcoxon signed-rank), and explicit per-criterion performance. All abstract claims will be explicitly tied to these numbers. revision: yes
Circularity Check
No significant circularity; derivation grounded in external expert labels
full rationale
The paper constructs AesRM by training on externally provided expert-annotated preference pairs (VA/VF/VP across 15 criteria) via a three-stage process. The self-consistency CoT synthesis step generates and filters reasoning traces from the model itself before using them as additional training signals, but this is a standard bootstrapping technique whose outputs are still scored against the original expert labels rather than being definitionally equivalent to them. Benchmark performance on AesVideo-Bench and downstream Wan2.2 alignment gains are measured against held-out or external baselines, so the central claims do not reduce to a tautology by construction. No self-definitional equations, fitted-input-renamed-as-prediction, or load-bearing self-citations appear in the derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert annotations on video pairs provide consistent and accurate labels for the 15 fine-grained criteria
invented entities (1)
-
Hierarchical rubric decomposing aesthetics into VA, VF, and VP with 15 criteria
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback
URLhttps://arxiv.org/abs/2412.02617. Fei Gao, Yuhao Lin, Jiaqi Shi, Maoying Qiao, and Nannan Wang. Aesmamba: Universal image aesthetic assessment with state space models. InProceedings of the 32nd ACM International Conference on Multimedia, pp. 7444–7453, 2024. Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/cvpr.2012.6247954 2024
-
[2]
IEEE, 2012b. Yuzhen Niu and Feng Liu. What makes a professional video? a computational aesthetics approach. IEEE Transactions on Circuits and Systems for Video Technology, 22(7):1037–1049, 2012. OpenAI. Gpt-4 technical report, 2024. URLhttps://arxiv.org/abs/2303.08774. Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for ...
work page internal anchor Pith review arXiv 2012
-
[3]
Wan: Open and Advanced Large-Scale Video Generative Models
IEEE, 2016. Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, et al. Wo...
work page internal anchor Pith review arXiv 2016
-
[4]
DanceGRPO: Unleashing GRPO on Visual Generation
URLhttps://arxiv.org/abs/2505.07818. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. Ran Yi, Haoyuan Tian, Zhihao Gu, Yu-Kun Lai, and Paul L Rosin. Towards artistic image aesthetics assessment: a large-scale dataset a...
work page internal anchor Pith review arXiv 2025
-
[5]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
URLhttps://arxiv.org/abs/2504.10479. 18 Appendix of AesRM: Improving Video Aesthetics with Expert-Level Feedback Contents A Hierarchical Aesthetic Evaluation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.1 Visual Aesthetics (V A). . . . . . . . . . . . . . . . . . . . . . . . . . . ....
work page internal anchor Pith review arXiv
-
[6]
Section A provides a detailed definition of video aesthetics, including three core dimensions and fifteen criteria, along with detailed explanations of each criterion
-
[7]
Section B provides further details on the construction of AesVideo-Bench
-
[8]
Section C describes the three-stage training pipeline for AesRM, including the system prompts for each stage, full training hyperparameters, and detailed explanations of our key techniques, Self-Consistency-based CoT Synthesis and process reward design
-
[9]
Light Style
Section D presents the experimental setup, including evaluated benchmarks, metrics, base- lines, and the used two post-training strategies: flow-RWR and pref-GRPO. This section also reports additional quantitative results, including more accuracy metrics for AesRM on AesVideo-Bench and the post-training performance of pref-GRPO, as well as addi- tional qu...
-
[10]
Output only one sentence as the final evaluation, covering the comparison between Video A and Video B across all three dimensions, with no analysis or reasoning
-
[11]
For example: Video A underperforms Video B in visual aesthetics, while the two are comparable in visual fidelity and visual plausibility
For each dimension, the comparison must be one of: Video A outperforms Video B / Video A under- performs Video B / the two are comparable. For example: Video A underperforms Video B in visual aesthetics, while the two are comparable in visual fidelity and visual plausibility
-
[12]
Table 9: Full system prompt for AesRM-CoT
Wrap the conclusion in<answer>Your evaluation</answer>. Table 9: Full system prompt for AesRM-CoT. Prompt Text You are a seasoned film and television analyst with a rigorous, detail-oriented approach and deep expertise in video aesthetics. You are also an accomplished copywriter. You will be given two videos: Video A and Video B. Both videos are generated...
-
[13]
The prompt’s requirement for this criterion and the corresponding standard (use professional termi- nology and the criterion definition)
-
[14]
Whether Video A satisfies the requirement and why
-
[15]
$! & % #
Whether Video B satisfies the requirement and why. 4.Score (A relative to B):{1, -1, 0}, with a brief justification. •Scoring rule:If Video A is better than Video B (with clear evidence), assign 1; if worse, assign -1; if similar / evidence is insufficient / not applicable, assign 0. •Severe violation rule:If a severe violation occurs in any dimension (e....
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.