arxiv: 2605.03475 · v2 · submitted 2026-05-05 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models

Karthik Inbasekar , Guy Rom , Omer Shlomovits

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords generative video modelsvideo quality benchmarkVLM as judgehuman preference studyBradley-Terry ratingLikert scale evaluationmulti-dimensional assessmentvideo generation metrics

0 comments

The pith

A VLM using Likert-scale questionnaires on native-resolution frames reproduces human three-tier video quality rankings with perfect correlation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes WorldJen as an end-to-end benchmark that replaces pixel-based metrics and binary VQA with multi-dimensional Likert evaluations graded by a vision-language model. It first gathers human pairwise preferences on 50 adversarially curated prompts across six video models to produce Bradley-Terry ratings that form three clear tiers. The VLM judge, answering ten dimension-specific questions per prompt on full-resolution frames, independently recovers the identical tier ordering. This matters because current metrics either ignore semantic and physical correctness or require separate low-resolution audits for each quality aspect, driving up evaluation cost while missing temporal failures.

Core claim

WorldJen shows that a VLM-as-a-judge system, supplied with prompt-specific Likert questionnaires covering up to sixteen quality dimensions simultaneously, matches the three-tier structure of human Bradley-Terry ratings derived from 2,696 pairwise annotations, achieving Spearman correlation of 1.000.

What carries the argument

The VLM-as-a-judge engine that scores videos via dimension-specific Likert questionnaires (ten questions each) at native resolution, validated against human-derived Bradley-Terry ratings.

If this is right

Generative video models can be ranked on multiple quality dimensions at once without generating separate videos for each dimension.
VLM judges become usable as scalable stand-ins for human raters once the three-tier agreement is confirmed.
Evaluation no longer depends on low-resolution binary auditors that overlook temporal inconsistencies.
The six ablation studies demonstrate that the Likert format and native-resolution input are both required for the observed agreement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could support continuous automated leaderboards that update as new video models appear.
Extending the same prompt curation and questionnaire design to longer videos or 3D content would test whether the tier agreement holds beyond the current 50-prompt scope.

Load-bearing premise

The human preference study with 66.9 percent inter-annotator agreement on fifty prompts supplies a stable ground-truth three-tier ranking.

What would settle it

Running the same VLM judge on a fresh set of prompts or models yields Spearman correlation below 0.9 with new human Bradley-Terry ratings.

read the original abstract

Evaluating generative video models remains an open problem. Reference-based metrics such as Structural Similarity Index Measure (SSIM) and Peak Signal to Noise Ratio (PSNR) reward pixel fidelity over semantic correctness, while Frechet Video Distance (FVD) favors distributional textures over physical plausibility. Binary Visual Question Answering (VQA) based benchmarks like VBench~2.0 are prone to yes-bias and rely on low-resolution auditors that miss temporal failures. Moreover, their prompts target a single dimension at a time, multiplying the number of videos required while still not guaranteeing reliable results. WorldJen addresses these limitations directly. Binary VQA is replaced with Likert-scale questionnaires graded by a VLM that receives frames at native video resolution. Video generation costs are addressed by using adversarially curated prompts that are designed to exercise up to 16 quality dimensions simultaneously. The framework is built around two interlocking contributions. First, A blind human preference study is conducted, accumulating (2,696 pairwise annotations from 7 annotators with 100% pair coverage over 50 of the curated prompts $\times$ 6 state-of-the-art video models. A mean inter-annotator agreement of 66.9% is achieved and the study establishes a human ground-truth Bradley-Terry (BT) rating with a three-tier structure. Second, A VLM-as-a-judge evaluation engine using prompt-specific, dimension-specific Likert questionnaires (10 questions per dimension, 47,160 scored responses) judges the videos and reproduces the human-established three-tier BT rating structure independently. The VLM achieves a Spearman $\hat{\rho}=1.000,~p=0.0014$ that is interpreted as tier agreement with the human results. Six focused ablation studies validate the robustness of the VLM evaluation framework. Project page: https://moonmath.ai/worldjen/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WorldJen tries to improve video model evaluation with multi-dimensional prompts and VLM Likert judging validated on human BT tiers, but the human data is too noisy and coarse to carry the main claim.

read the letter

The main contribution here is a benchmark that generates prompts meant to stress multiple quality dimensions at once and then scores videos with a VLM on 10-question Likert scales per dimension instead of binary VQA. That setup directly targets the single-dimension and yes-bias problems in VBench and the semantic-blindness of FVD or SSIM. The authors also run a human preference study with 2,696 pairwise judgments across 6 models and 50 prompts, fit Bradley-Terry scores, and show the VLM reproduces the resulting three-tier ranking with Spearman rho of 1.0. Six ablations are included to check robustness of the VLM component. Those pieces are concrete and address real pain points in the field. The soft spot is the human ground truth itself. Mean inter-annotator agreement sits at 66.9 percent, the study uses only six models collapsed into three coarse tiers, and the perfect tier match is reported on that low-resolution signal. With 50 prompts and that level of noise, the correlation does not yet demonstrate that the 16 separate Likert scales are reliable or that they track the intended dimensions beyond aggregate ranking. The paper would benefit from either higher human agreement, more models, or an external check that the VLM scores predict something outside the original BT fit. This is the kind of work that belongs in a reading group for people building or evaluating video generators. It is not ready for citation as a settled benchmark, but it is worth sending to peer review so the validation questions can be aired and addressed.

Referee Report

3 major / 2 minor

Summary. The paper introduces WorldJen, a benchmark for generative video models that replaces binary VQA and reference-based metrics with a VLM-as-judge using prompt-specific Likert-scale questionnaires across 16 dimensions on adversarially curated prompts. It reports a human preference study with 2,696 pairwise annotations over 50 prompts and 6 models (66.9% mean inter-annotator agreement) that yields a Bradley-Terry model with a three-tier structure, and claims the VLM reproduces this exact tier structure with Spearman ρ̂=1.000 (p=0.0014) based on 47,160 scored responses, validated by six ablation studies.

Significance. If the validation holds, the work would be significant for the field by providing a scalable, multi-dimensional evaluation framework that aligns with human judgments on semantic and temporal aspects at native resolution, addressing documented weaknesses in FVD, SSIM, and binary VQA benchmarks. The independent human study and emphasis on simultaneous dimension coverage are clear strengths; however, the small scale of the human data limits the strength of claims about the reliability of the 16 individual dimensions.

major comments (3)

[validation section] Human preference study (described in the validation section): The ground-truth three-tier BT structure is derived from only 6 models and 50 prompts with 66.9% inter-annotator agreement, producing a low-resolution, noisy signal. The reported perfect Spearman ρ̂=1.000 (p=0.0014) on this coarsened aggregate ranking does not establish that the VLM's 16 independent Likert scales reliably measure the intended dimensions rather than merely recovering the coarse ordering.
[results section] VLM evaluation engine (results section): With only 6 models, the tier agreement test has very low statistical power; the p=0.0014 does not rule out that the VLM is capturing broad quality signals rather than the 16 distinct dimensions, especially given the absence of per-dimension human correlations or inter-dimension consistency checks.
[section 6] Ablation studies (section 6): The six ablations are cited as validating robustness, but without quantitative details on how they isolate VLM prompt biases, Likert scale calibration, or sensitivity to the adversarial prompt curation, it is unclear whether they address the core concern of circularity between VLM judgments and the human tiers.

minor comments (2)

[Abstract] The abstract would benefit from a brief statement on the exact VLM model, temperature settings, and prompt template used for the Likert questionnaires to improve reproducibility.
[validation section] Notation for the Bradley-Terry model and Spearman correlation should include the explicit formula or reference to avoid ambiguity in how ties in the three-tier structure are handled.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive comments, which highlight important considerations regarding scale and validation strength. We address each major comment point by point below, clarifying our methodology and indicating revisions where they strengthen the manuscript without altering core claims.

read point-by-point responses

Referee: [validation section] Human preference study (described in the validation section): The ground-truth three-tier BT structure is derived from only 6 models and 50 prompts with 66.9% inter-annotator agreement, producing a low-resolution, noisy signal. The reported perfect Spearman ρ̂=1.000 (p=0.0014) on this coarsened aggregate ranking does not establish that the VLM's 16 independent Likert scales reliably measure the intended dimensions rather than merely recovering the coarse ordering.

Authors: We agree the human study scale (50 prompts, 6 models, 66.9% agreement) yields a coarse three-tier BT structure, which is a deliberate design choice to achieve full pairwise coverage with 2,696 annotations. The perfect Spearman correlation validates alignment at the aggregate tier level, which is the primary claim. However, we acknowledge this does not directly prove independence of the 16 Likert scales, as human data consists of overall preferences rather than dimension-specific ratings. We will revise the validation section to explicitly state this scope and add a limitations paragraph discussing the aggregate nature of the human ground truth. revision: partial
Referee: [results section] VLM evaluation engine (results section): With only 6 models, the tier agreement test has very low statistical power; the p=0.0014 does not rule out that the VLM is capturing broad quality signals rather than the 16 distinct dimensions, especially given the absence of per-dimension human correlations or inter-dimension consistency checks.

Authors: The small number of models limits statistical power for the tier test, and we will note this explicitly. The p=0.0014 reflects the exact ordering match on the coarsened tiers. To support distinct dimensions, the manuscript includes inter-dimension correlation analysis in the ablations (showing average pairwise correlations below 0.3 across dimensions), indicating they capture non-redundant signals. We cannot provide per-dimension human correlations, as the preference study collected holistic pairwise judgments. We will add this clarification and the inter-dimension results to the results section. revision: partial
Referee: [section 6] Ablation studies (section 6): The six ablations are cited as validating robustness, but without quantitative details on how they isolate VLM prompt biases, Likert scale calibration, or sensitivity to the adversarial prompt curation, it is unclear whether they address the core concern of circularity between VLM judgments and the human tiers.

Authors: We will expand Section 6 with quantitative ablation results, including: (1) correlation shifts when using non-adversarial prompts to isolate curation effects; (2) Likert scale sensitivity tests via rescaling experiments; and (3) bias checks via prompt perturbation. These demonstrate that VLM scores remain stable and independent of the human tier derivation process, addressing circularity concerns. The ablations were designed to test robustness without relying on the human data. revision: yes

standing simulated objections not resolved

The human preference study collected only overall pairwise preferences and does not include per-dimension ratings, so direct validation of each of the 16 VLM Likert scales against human judgments on individual dimensions cannot be performed without a new, larger study.

Circularity Check

0 steps flagged

No significant circularity; VLM validation uses independent human Bradley-Terry ground truth

full rationale

The paper's central derivation consists of two independent stages: (1) a human preference study collecting 2,696 pairwise annotations over 50 prompts and 6 models to fit a Bradley-Terry model and derive a three-tier structure, and (2) a separate VLM judge producing 47,160 Likert-scale responses that is then compared to the human tiers via Spearman correlation. The VLM outputs are not fitted to the human data, nor are any parameters or definitions circularly interdependent. No self-citations appear as load-bearing premises, no uniqueness theorems are imported from prior author work, and no ansatz or renaming reduces the claimed result to its inputs by construction. The human study functions as an external benchmark rather than a self-referential fit.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

No free parameters or invented entities; relies on domain assumptions about VLM judgment capabilities and standard statistical models for preference aggregation.

axioms (2)

domain assumption A VLM can accurately grade video frames on Likert scales for multiple quality dimensions at native resolution
This underpins the VLM-as-a-judge engine replacing binary VQA.
standard math The Bradley-Terry model applied to pairwise human annotations yields a reliable three-tier ranking of video models
Used to establish human ground-truth.

pith-pipeline@v0.9.0 · 5648 in / 1184 out tokens · 88848 ms · 2026-05-08T19:02:31.142686+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation (no overlap) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Strengths are estimated via the Minorization-Maximization (MM) algorithm... BT rating_i = 1500 + 400 log_10(p_i / p̄_geom)
Cost (J-cost) Jcost_unit0 unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PHAS(m) = (1/|P|) Σ_p Σ_d w_d s_{m,p,d} / Σ_d w_d × λ(m,p), where w_d are calibrated by non-negative ridge logistic regression on human preference annotations.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

98 extracted references · 9 canonical work pages · 2 internal anchors

[1]

VideoPhy: Evaluating physical commonsense for video generation

Hritik Bansal, Zongyu Lee, Xinkai Ma, Vikram Li, Aditya Grover, Kai-Wei Chang, and Nanyun Peng. VideoPhy: Evaluating physical commonsense for video generation. InInternational Conference on Learning Representations (ICLR), 2025

2025
[2]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

1952
[3]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV, 2021

2021
[4]

Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhuang, Zhanghao Wu, Yonghao Zhuang, Joseph E

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios N. Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhuang, Zhanghao Wu, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P . Xing. Chatbot arena: An open platform for evaluating LLMs by human preference. InICML, 2024

2024
[5]

fal.ai: Fast inference for generative AI.https://fal.ai, 2024

fal.ai. fal.ai: Fast inference for generative AI.https://fal.ai, 2024

2024
[6]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page Pith review arXiv 2023
[7]

Goodhart

Charles A.E. Goodhart. Problems of monetary management: The UK experience.Papers in Monetary Economics, 1975

1975
[8]

Veo 3: State-of-the-art video generation

Google DeepMind. Veo 3: State-of-the-art video generation. https://deepmind.google/tech nologies/veo/, 2025

2025
[9]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Amit Brazowski, Neta Shaul, Omer Berman, Daniel Peleg, Idan Leshem, Uriel Singer, Dana Tamir, David Grabli, et al. LTX-Video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103, 2025

work page internal anchor Pith review arXiv 2025
[10]

Tag2Text: Guiding vision-language model via image tagging

Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, and Lei Zhang. Tag2Text: Guiding vision-language model via image tagging. InICLR, 2024

2024
[11]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognit...

2024
[12]

Vbench++: Comprehensive and versatile benchmark suite for video generative models

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench++: Comprehensive and versatile benchmark suite for video generative models. arXiv preprint arXiv:2411.13503, 2024

work page arXiv 2024
[13]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Ziqi Huang, Fan Zhang, Xiaojie Luo, Chenyang Si, Yinan He, et al. VBench-2.0: Advancing video generation benchmark with intrinsic faithfulness. arXiv preprint arXiv:2503.21755, 2025. 31

work page internal anchor Pith review arXiv 2025
[14]

MM algorithms for generalized Bradley-Terry models.The Annals of Statistics, 32 (1):384–406, 2004

David R Hunter. MM algorithms for generalized Bradley-Terry models.The Annals of Statistics, 32 (1):384–406, 2004

2004
[15]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, et al. HunyuanVideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

work page Pith review arXiv 2024
[16]

A guideline of selecting and reporting intraclass correlation coefficients for reliability research.Journal of Chiropractic Medicine, 15(2):155–163, 2016

Terry K Koo and Mae Y Li. A guideline of selecting and reporting intraclass correlation coefficients for reliability research.Journal of Chiropractic Medicine, 15(2):155–163, 2016

2016
[17]

Sage Publications, 4th edition, 2018

Klaus Krippendorff.Content Analysis: An Introduction to Its Methodology. Sage Publications, 4th edition, 2018

2018
[18]

Kling: A generative video foundation model

Kuaishou Technology. Kling: A generative video foundation model. https://kling.kuaishou .com, 2024

2024
[19]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159–174, 1977

1977
[20]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. AlpacaEval: An automatic evaluator of instruction-following models.https://github.com/tatsu-lab/alpaca_eval, 2023

2023
[21]

WildBench: Benchmarking LLMs with challenging tasks from real users in the wild

Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Bhatt, Abhilasha Ravichander, et al. WildBench: Benchmarking LLMs with challenging tasks from real users in the wild. InICLR, 2024

2024
[22]

EvalCrafter: Benchmarking and evaluating large video generation models

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. EvalCrafter: Benchmarking and evaluating large video generation models. InCVPR, 2024

2024
[23]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021

2021
[24]

VQQA: An agentic approach for video evaluation and quality improvement.arXiv preprint arXiv:2603.12310, 2026

Yiwen Song, Tomas Pfister, and Yale Song. VQQA: An agentic approach for video evaluation and quality improvement.arXiv preprint arXiv:2603.12310, 2026

work page arXiv 2026
[25]

T2V-CompBench: A comprehensive benchmark for compositional text-to-video generation

Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2V-CompBench: A comprehensive benchmark for compositional text-to-video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[26]

RAFT: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. RAFT: Recurrent all-pairs field transforms for optical flow. InECCV, 2020

2020
[27]

Towards accurate generative models of video: A new metric & challenges

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. InICLR Workshop on Deep Generative Models for Highly Structured Data, 2019

2019
[28]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan Team. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025

work page Pith review arXiv 2025
[29]

A very big video reasoning suite

Maijunxian Wang, Zhongang Cai, et al. A Very Big Video Reasoning Suite.arXiv preprint arXiv:2602.20159, 2026. URLhttps://arxiv.org/abs/2602.20159

work page arXiv 2026
[30]

VidProM: A million-scale real prompt-gallery dataset for text-to-video diffusion models

Wenhao Wang and Yi Yang. VidProM: A million-scale real prompt-gallery dataset for text-to-video diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[31]

Bovik, Hamid R

Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P . Simoncelli. Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004. 32

2004
[32]

Grit: A generative region-to-text transformer for object understanding

Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, and Lijuan Wang. GRiT: A generative region-to-text transformer for object understanding.arXiv preprint arXiv:2212.00280, 2022

work page arXiv 2022
[33]

Jerrold H. Zar. Significance testing of the Spearman rank correlation coefficient.Journal of the American Statistical Association, 67(339):578–580, 1972

1972
[34]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

2018
[35]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P . Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena.Advances in Neural Information Processing Systems (NeurIPS), 36, 2023. A. Human Evals A.1. Annotation Protocol This secti...

2023
[36]

All 300 video files are stored on Google Drive; the Apps Script backend maps each (prompt_id, model) tuple to a Drive file ID

Confirm assets.VLM evaluation is complete for all50 prompts × 6 models = 300 videos. All 300 video files are stored on Google Drive; the Apps Script backend maps each (prompt_id, model) tuple to a Drive file ID
[37]

Left/right assignment is randomised independently per pair per session

Pair generation.For each prompt, the interface automatically enumerates all 6 2 = 15 model pairings, yielding750 total pairs. Left/right assignment is randomised independently per pair per session. Already-completed pairs (stored in the annotator’s Google Sheet) are filtered out on resume so no pair is shown twice to the same annotator
[38]

All Done

Session design.Each annotator’s queue contains all 750 pairs they have not yet personally judged, sorted by ascending global coverage (least-reviewed pairs first). A break overlay appears every 50 pairs. An annotator who completes all their remaining pairs sees an “All Done” screen
[39]

Video A” / “Video B

Access control.Annotators identify themselves by entering their email address at session start. The interface normalises emails to lowercase for consistent history lookup. Drive folder access is granted at the folder level so the script can serve video blobs.Note: the publicly released dataset uses anonymized annotator IDs (A1–A7) in place of email addres...
[40]

Read the prompt first.Base your decision on prompt-faithfulness, not visual polish
[41]

Prioritise core action over background.Rank requirements mentally: Core Action / Physics → Characters →Background
[42]

Symmetric artifacts cancel.If both videos flicker or both clip, ignore that and judge whatdiffers
[43]

barely better

Forced choice — no skips.Choose the more faithful attempt even if neither is perfect. UseSlightly better when the margin is thin. Decision Tree Can you tell the videos apart on prompt faithfulness? | ⊢One succeeds where the other clearly fails→Much better (e.g. core action done vs. not done; physics violated vs. correct) | ⊢Noticeable gap, but both have s...
[44]

Annotators click Continuewhen ready

Session breaks.After every 50 pairs a break overlay appears with the pair count. Annotators click Continuewhen ready. Progress is saved continuously to Google Sheets, so closing the browser is safe
[45]

Loading. . . (slow connection)

Slow connections.If a video does not buffer within 5 seconds, a “Loading. . . (slow connection)” hint is shown. After 12 seconds the client automatically re-fetches a fresh copy of the video from Drive. 35 Step 3 — Data Export and Aggregation
[46]

In the released anonymized dataset, timestamp is dropped and email is replaced by an opaque annotator ID (A1–A7)

Google Sheets format.Each vote is appended as one row with columns: timestamp, email, prompt_id, model_a, model_b, winner, loser, confidence, source . In the released anonymized dataset, timestamp is dropped and email is replaced by an opaque annotator ID (A1–A7)
[47]

Inter-annotator agreement.For pairs judged by ≥ 2 annotators, compute mean IAA and Krippen- dorff’s𝛼
[48]

Report 95% bootstrap CIs (1,000 resamples)

Human BT rating.Pool all comparisons and fit an unweighted Bradley-Terry model (each vote contributes one win/loss; confidence labels are used only in the PHAS step below) to obtain per- model Human BT rating anchored at 1500. Report 95% bootstrap CIs (1,000 resamples)
[49]

• Label 𝑦∈ { 0, 1}: 𝑦=1 if the annotator preferred 𝑚𝐴, 𝑦=0 otherwise; sample weight = confidence (Much/Clearly/Slightly→3/2/1)

Calibrated PHAS weights.Using the 30-prompt calibration split (1,653 annotations; disjoint from the 20-prompt validation set), fit a non-negative constrained ridge logistic regression: • Feature vectorx ∈R 16: per-dimension VLM score difference 𝑥𝑑 =𝑠 𝑚𝐴,𝑝,𝑑 −𝑠 𝑚𝐵,𝑝,𝑑 for each applicable dimension; null-suitability dimensions are excluded (not zeroed) per ...
[50]

Human BT rating)

Validation.Evaluate the calibrated weights on the held-out 20-prompt validation set (1,043 annota- tions): report pairwise prediction accuracy and PHAS model ranking (Spearman ˆ𝜌vs. Human BT rating). B. Prompt Curation B.1. Definitions Table 20 provides full definitions for all 16 evaluation dimensions. B.2. VidProM Filtering Pipeline The filtering pipeli...
[51]

NSFW/safety filter:VidProM’s built-in classifier removes sexually explicit, violent, and hateful content
[52]

3.Length filter:Prompts<30 characters or>500 characters are removed

Deduplication:Exact-hash deduplication followed by MinHash/LSH near-duplicate removal with a Jaccard threshold of 0.8. 3.Length filter:Prompts<30 characters or>500 characters are removed
[53]

Prompts in the bottom quartile are dis- carded

Complexity score:An LLM-estimated score rewards prompts involving physics interactions, multi- subject scenes, temporal events, and spatial relationships. Prompts in the bottom quartile are dis- carded
[54]

Blacklist:Prompts containing URLs, political figures, named celebrities, or trademarked properties are flagged and removed
[55]

These stages retain approximately 5,000 prompts (∼0.3% of the original corpus)

Spam detection:Repetitive, malformed, or auto-generated prompts are removed via an n-gram-based classifier. These stages retain approximately 5,000 prompts (∼0.3% of the original corpus). Subsequent LLM judging further flags 276 (7.4%) for copyright/safety review, yielding the final set of 3,754 unique prompts. 36 Table 20|Complete dimension taxonomy used...
[56]

subject_consistency: Does the main character/object change shape, color, or identity during the video? - suitability: Does this prompt create conditions where subject inconsistency would be exposed? - difficulty: How hard for a video model to keep identity consistent?
[57]

scene_consistency: Does the environment stay stable or warp/melt? - suitability: Does this prompt expose scene warping during camera motion? - difficulty: How hard to keep scene stable during camera motion?
[58]

motion_smoothness: Does the video have stuttering or jitter? - suitability: Does the prompt expose frame skips? (fast/complex motion scores high) - difficulty: How hard for a model to render this motion smoothly? - NOTE: Rendering quality only — not physics
[59]

Intentional lighting changes are NOT flickering

temporal_flickering: Are there flashes or brightness artifacts? - suitability: Score high only for complex textures (water, hair, fire, smoke, fine patterns). Intentional lighting changes are NOT flickering. - difficulty: How hard to avoid unwanted flickering?
[60]

- difficulty: How hard to render physically accurate inertia? - NOTE: Physics (velocity changes) only — not rendering smoothness

inertial_consistency: Do objects follow laws of momentum? - suitability: Focus on velocity changes (falling, stopping, throwing, catching, sliding to a stop). - difficulty: How hard to render physically accurate inertia? - NOTE: Physics (velocity changes) only — not rendering smoothness. **Group B: Logic & Physics** Applicable if: prompt involves physical...
[61]

physical_mechanics: Do gravity, friction, and collisions look realistic?
[62]

object_permanence: If an object goes behind a wall, does it look the same when it reappears?
[63]

human_fidelity: Are humans rendered without alien artifacts (extra fingers, distorted faces, impossible body twisting)? Set to null if no humans in the prompt
[64]

dynamic_degree: Is there actual movement, or just a still image with zoom? 39 **Group C: Instruction Adherence** Applicable if: prompt has specific objects, colors, spatial relationships, or precise requirements
[65]

semantic_adherence: Does the video contain exactly what was asked?
[66]

spatial_relationship: Are objects in the right relative positions?
[67]

semantic_drift: Does the AI start following the prompt but "forget" it halfway through? **Group D: Aesthetic Quality** Applicable if: prompt involves specific artistic styles, high-detail environments, or cinematic descriptions
[68]

composition_framing: Is the shot well-balanced?
[69]

lighting_volumetric: Is the lighting realistic with depth?
[70]

color_harmony: Are the colors pleasing and consistent?
[71]

- Difficulty: 1 = easy for model, 5 = moderate, 10 = extremely hard

structural_gestalt: Do elements look like they belong in the same world? Scoring guidelines: - Suitability: 1 = poor test, 5 = decent test, 10 = excellent/ideal. - Difficulty: 1 = easy for model, 5 = moderate, 10 = extremely hard. - Set scores to null for non-applicable dimensions. - Flag needs_review = true for harmful, policy-violating, or copyright-sen...
[72]

Fixing language: Correct grammar, spelling, improve coherence
[73]

Addressing weak dimensions: Add specific elements to boost weak dimensions (listed in the user message)
[74]

individual water droplets

Preserving core theme: Keep the main subject and concept EXACTLY as intended. Guidelines for weak dimensions (use specific, stress-testing details): - motion_smoothness: Add fast/complex motion (running, spinning, fast-moving objects). - temporal_flickering: Add complex textures (water, fire, hair, reflective surfaces). Specify high-frequency details like...
[75]

Generate 10 questions that specifically probe that dimension as it relates to this prompt
[76]

Does the character’s face distort when they turn?

Questions should cover: - Expected events and details mentioned in the prompt. - Potential failure modes (e.g., "Does the character’s face distort when they turn?"). - Success modes (e.g., "Is the reflection on the water consistent with the light source?"). - Adversarial probing (checking for subtle inconsistencies)
[77]

question

For each question, define a 1-5 scoring rubric: - 1: Major failure / Completely incorrect. - 2: Notable artifacts / Significant issues. - 3: Mediocre / passable but flawed. - 4: Good / minor imperfections only. - 5: Perfect / Flawless execution. Return ONLY a JSON object where keys are the dimension names and values are lists of 10 question objects. Each ...
[78]

{question_1} Rubric: {rubric_description_1}
[79]

{question_2} Rubric: {rubric_description_2}
[80]

score": X,

{question_10} Rubric: {rubric_description_10} Answer each question with a score (1-5) and a short justification. Return ONLY a JSON list of objects: [{"score": X, "justification": "..."}, ...] D. Case Studies This appendix presents two contrasting case studies.Prompt_1732(§ D.1) is aclean example: both judges (Gemini/Claude) agree, rankings mirror the glo...

Showing first 80 references.