pith. machine review for the scientific record. sign in

arxiv: 2605.03475 · v2 · submitted 2026-05-05 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords generative video modelsvideo quality benchmarkVLM as judgehuman preference studyBradley-Terry ratingLikert scale evaluationmulti-dimensional assessmentvideo generation metrics
0
0 comments X

The pith

A VLM using Likert-scale questionnaires on native-resolution frames reproduces human three-tier video quality rankings with perfect correlation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes WorldJen as an end-to-end benchmark that replaces pixel-based metrics and binary VQA with multi-dimensional Likert evaluations graded by a vision-language model. It first gathers human pairwise preferences on 50 adversarially curated prompts across six video models to produce Bradley-Terry ratings that form three clear tiers. The VLM judge, answering ten dimension-specific questions per prompt on full-resolution frames, independently recovers the identical tier ordering. This matters because current metrics either ignore semantic and physical correctness or require separate low-resolution audits for each quality aspect, driving up evaluation cost while missing temporal failures.

Core claim

WorldJen shows that a VLM-as-a-judge system, supplied with prompt-specific Likert questionnaires covering up to sixteen quality dimensions simultaneously, matches the three-tier structure of human Bradley-Terry ratings derived from 2,696 pairwise annotations, achieving Spearman correlation of 1.000.

What carries the argument

The VLM-as-a-judge engine that scores videos via dimension-specific Likert questionnaires (ten questions each) at native resolution, validated against human-derived Bradley-Terry ratings.

If this is right

  • Generative video models can be ranked on multiple quality dimensions at once without generating separate videos for each dimension.
  • VLM judges become usable as scalable stand-ins for human raters once the three-tier agreement is confirmed.
  • Evaluation no longer depends on low-resolution binary auditors that overlook temporal inconsistencies.
  • The six ablation studies demonstrate that the Likert format and native-resolution input are both required for the observed agreement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could support continuous automated leaderboards that update as new video models appear.
  • Extending the same prompt curation and questionnaire design to longer videos or 3D content would test whether the tier agreement holds beyond the current 50-prompt scope.

Load-bearing premise

The human preference study with 66.9 percent inter-annotator agreement on fifty prompts supplies a stable ground-truth three-tier ranking.

What would settle it

Running the same VLM judge on a fresh set of prompts or models yields Spearman correlation below 0.9 with new human Bradley-Terry ratings.

read the original abstract

Evaluating generative video models remains an open problem. Reference-based metrics such as Structural Similarity Index Measure (SSIM) and Peak Signal to Noise Ratio (PSNR) reward pixel fidelity over semantic correctness, while Frechet Video Distance (FVD) favors distributional textures over physical plausibility. Binary Visual Question Answering (VQA) based benchmarks like VBench~2.0 are prone to yes-bias and rely on low-resolution auditors that miss temporal failures. Moreover, their prompts target a single dimension at a time, multiplying the number of videos required while still not guaranteeing reliable results. WorldJen addresses these limitations directly. Binary VQA is replaced with Likert-scale questionnaires graded by a VLM that receives frames at native video resolution. Video generation costs are addressed by using adversarially curated prompts that are designed to exercise up to 16 quality dimensions simultaneously. The framework is built around two interlocking contributions. First, A blind human preference study is conducted, accumulating (2,696 pairwise annotations from 7 annotators with 100% pair coverage over 50 of the curated prompts $\times$ 6 state-of-the-art video models. A mean inter-annotator agreement of 66.9% is achieved and the study establishes a human ground-truth Bradley-Terry (BT) rating with a three-tier structure. Second, A VLM-as-a-judge evaluation engine using prompt-specific, dimension-specific Likert questionnaires (10 questions per dimension, 47,160 scored responses) judges the videos and reproduces the human-established three-tier BT rating structure independently. The VLM achieves a Spearman $\hat{\rho}=1.000,~p=0.0014$ that is interpreted as tier agreement with the human results. Six focused ablation studies validate the robustness of the VLM evaluation framework. Project page: https://moonmath.ai/worldjen/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces WorldJen, a benchmark for generative video models that replaces binary VQA and reference-based metrics with a VLM-as-judge using prompt-specific Likert-scale questionnaires across 16 dimensions on adversarially curated prompts. It reports a human preference study with 2,696 pairwise annotations over 50 prompts and 6 models (66.9% mean inter-annotator agreement) that yields a Bradley-Terry model with a three-tier structure, and claims the VLM reproduces this exact tier structure with Spearman ρ̂=1.000 (p=0.0014) based on 47,160 scored responses, validated by six ablation studies.

Significance. If the validation holds, the work would be significant for the field by providing a scalable, multi-dimensional evaluation framework that aligns with human judgments on semantic and temporal aspects at native resolution, addressing documented weaknesses in FVD, SSIM, and binary VQA benchmarks. The independent human study and emphasis on simultaneous dimension coverage are clear strengths; however, the small scale of the human data limits the strength of claims about the reliability of the 16 individual dimensions.

major comments (3)
  1. [validation section] Human preference study (described in the validation section): The ground-truth three-tier BT structure is derived from only 6 models and 50 prompts with 66.9% inter-annotator agreement, producing a low-resolution, noisy signal. The reported perfect Spearman ρ̂=1.000 (p=0.0014) on this coarsened aggregate ranking does not establish that the VLM's 16 independent Likert scales reliably measure the intended dimensions rather than merely recovering the coarse ordering.
  2. [results section] VLM evaluation engine (results section): With only 6 models, the tier agreement test has very low statistical power; the p=0.0014 does not rule out that the VLM is capturing broad quality signals rather than the 16 distinct dimensions, especially given the absence of per-dimension human correlations or inter-dimension consistency checks.
  3. [section 6] Ablation studies (section 6): The six ablations are cited as validating robustness, but without quantitative details on how they isolate VLM prompt biases, Likert scale calibration, or sensitivity to the adversarial prompt curation, it is unclear whether they address the core concern of circularity between VLM judgments and the human tiers.
minor comments (2)
  1. [Abstract] The abstract would benefit from a brief statement on the exact VLM model, temperature settings, and prompt template used for the Likert questionnaires to improve reproducibility.
  2. [validation section] Notation for the Bradley-Terry model and Spearman correlation should include the explicit formula or reference to avoid ambiguity in how ties in the three-tier structure are handled.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive comments, which highlight important considerations regarding scale and validation strength. We address each major comment point by point below, clarifying our methodology and indicating revisions where they strengthen the manuscript without altering core claims.

read point-by-point responses
  1. Referee: [validation section] Human preference study (described in the validation section): The ground-truth three-tier BT structure is derived from only 6 models and 50 prompts with 66.9% inter-annotator agreement, producing a low-resolution, noisy signal. The reported perfect Spearman ρ̂=1.000 (p=0.0014) on this coarsened aggregate ranking does not establish that the VLM's 16 independent Likert scales reliably measure the intended dimensions rather than merely recovering the coarse ordering.

    Authors: We agree the human study scale (50 prompts, 6 models, 66.9% agreement) yields a coarse three-tier BT structure, which is a deliberate design choice to achieve full pairwise coverage with 2,696 annotations. The perfect Spearman correlation validates alignment at the aggregate tier level, which is the primary claim. However, we acknowledge this does not directly prove independence of the 16 Likert scales, as human data consists of overall preferences rather than dimension-specific ratings. We will revise the validation section to explicitly state this scope and add a limitations paragraph discussing the aggregate nature of the human ground truth. revision: partial

  2. Referee: [results section] VLM evaluation engine (results section): With only 6 models, the tier agreement test has very low statistical power; the p=0.0014 does not rule out that the VLM is capturing broad quality signals rather than the 16 distinct dimensions, especially given the absence of per-dimension human correlations or inter-dimension consistency checks.

    Authors: The small number of models limits statistical power for the tier test, and we will note this explicitly. The p=0.0014 reflects the exact ordering match on the coarsened tiers. To support distinct dimensions, the manuscript includes inter-dimension correlation analysis in the ablations (showing average pairwise correlations below 0.3 across dimensions), indicating they capture non-redundant signals. We cannot provide per-dimension human correlations, as the preference study collected holistic pairwise judgments. We will add this clarification and the inter-dimension results to the results section. revision: partial

  3. Referee: [section 6] Ablation studies (section 6): The six ablations are cited as validating robustness, but without quantitative details on how they isolate VLM prompt biases, Likert scale calibration, or sensitivity to the adversarial prompt curation, it is unclear whether they address the core concern of circularity between VLM judgments and the human tiers.

    Authors: We will expand Section 6 with quantitative ablation results, including: (1) correlation shifts when using non-adversarial prompts to isolate curation effects; (2) Likert scale sensitivity tests via rescaling experiments; and (3) bias checks via prompt perturbation. These demonstrate that VLM scores remain stable and independent of the human tier derivation process, addressing circularity concerns. The ablations were designed to test robustness without relying on the human data. revision: yes

standing simulated objections not resolved
  • The human preference study collected only overall pairwise preferences and does not include per-dimension ratings, so direct validation of each of the 16 VLM Likert scales against human judgments on individual dimensions cannot be performed without a new, larger study.

Circularity Check

0 steps flagged

No significant circularity; VLM validation uses independent human Bradley-Terry ground truth

full rationale

The paper's central derivation consists of two independent stages: (1) a human preference study collecting 2,696 pairwise annotations over 50 prompts and 6 models to fit a Bradley-Terry model and derive a three-tier structure, and (2) a separate VLM judge producing 47,160 Likert-scale responses that is then compared to the human tiers via Spearman correlation. The VLM outputs are not fitted to the human data, nor are any parameters or definitions circularly interdependent. No self-citations appear as load-bearing premises, no uniqueness theorems are imported from prior author work, and no ansatz or renaming reduces the claimed result to its inputs by construction. The human study functions as an external benchmark rather than a self-referential fit.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

No free parameters or invented entities; relies on domain assumptions about VLM judgment capabilities and standard statistical models for preference aggregation.

axioms (2)
  • domain assumption A VLM can accurately grade video frames on Likert scales for multiple quality dimensions at native resolution
    This underpins the VLM-as-a-judge engine replacing binary VQA.
  • standard math The Bradley-Terry model applied to pairwise human annotations yields a reliable three-tier ranking of video models
    Used to establish human ground-truth.

pith-pipeline@v0.9.0 · 5648 in / 1184 out tokens · 88848 ms · 2026-05-08T19:02:31.142686+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation (no overlap) washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Strengths are estimated via the Minorization-Maximization (MM) algorithm... BT rating_i = 1500 + 400 log_10(p_i / p̄_geom)

  • Cost (J-cost) Jcost_unit0 unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    PHAS(m) = (1/|P|) Σ_p Σ_d w_d s_{m,p,d} / Σ_d w_d × λ(m,p), where w_d are calibrated by non-negative ridge logistic regression on human preference annotations.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

98 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    VideoPhy: Evaluating physical commonsense for video generation

    Hritik Bansal, Zongyu Lee, Xinkai Ma, Vikram Li, Aditya Grover, Kai-Wei Chang, and Nanyun Peng. VideoPhy: Evaluating physical commonsense for video generation. InInternational Conference on Learning Representations (ICLR), 2025

  2. [2]

    Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

  3. [3]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV, 2021

  4. [4]

    Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhuang, Zhanghao Wu, Yonghao Zhuang, Joseph E

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios N. Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhuang, Zhanghao Wu, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P . Xing. Chatbot arena: An open platform for evaluating LLMs by human preference. InICML, 2024

  5. [5]

    fal.ai: Fast inference for generative AI.https://fal.ai, 2024

    fal.ai. fal.ai: Fast inference for generative AI.https://fal.ai, 2024

  6. [6]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  7. [7]

    Goodhart

    Charles A.E. Goodhart. Problems of monetary management: The UK experience.Papers in Monetary Economics, 1975

  8. [8]

    Veo 3: State-of-the-art video generation

    Google DeepMind. Veo 3: State-of-the-art video generation. https://deepmind.google/tech nologies/veo/, 2025

  9. [9]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Amit Brazowski, Neta Shaul, Omer Berman, Daniel Peleg, Idan Leshem, Uriel Singer, Dana Tamir, David Grabli, et al. LTX-Video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103, 2025

  10. [10]

    Tag2Text: Guiding vision-language model via image tagging

    Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, and Lei Zhang. Tag2Text: Guiding vision-language model via image tagging. InICLR, 2024

  11. [11]

    VBench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognit...

  12. [12]

    Vbench++: Comprehensive and versatile benchmark suite for video generative models

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench++: Comprehensive and versatile benchmark suite for video generative models. arXiv preprint arXiv:2411.13503, 2024

  13. [13]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Ziqi Huang, Fan Zhang, Xiaojie Luo, Chenyang Si, Yinan He, et al. VBench-2.0: Advancing video generation benchmark with intrinsic faithfulness. arXiv preprint arXiv:2503.21755, 2025. 31

  14. [14]

    MM algorithms for generalized Bradley-Terry models.The Annals of Statistics, 32 (1):384–406, 2004

    David R Hunter. MM algorithms for generalized Bradley-Terry models.The Annals of Statistics, 32 (1):384–406, 2004

  15. [15]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, et al. HunyuanVideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

  16. [16]

    A guideline of selecting and reporting intraclass correlation coefficients for reliability research.Journal of Chiropractic Medicine, 15(2):155–163, 2016

    Terry K Koo and Mae Y Li. A guideline of selecting and reporting intraclass correlation coefficients for reliability research.Journal of Chiropractic Medicine, 15(2):155–163, 2016

  17. [17]

    Sage Publications, 4th edition, 2018

    Klaus Krippendorff.Content Analysis: An Introduction to Its Methodology. Sage Publications, 4th edition, 2018

  18. [18]

    Kling: A generative video foundation model

    Kuaishou Technology. Kling: A generative video foundation model. https://kling.kuaishou .com, 2024

  19. [19]

    Richard Landis and Gary G

    J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159–174, 1977

  20. [20]

    Hashimoto

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. AlpacaEval: An automatic evaluator of instruction-following models.https://github.com/tatsu-lab/alpaca_eval, 2023

  21. [21]

    WildBench: Benchmarking LLMs with challenging tasks from real users in the wild

    Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Bhatt, Abhilasha Ravichander, et al. WildBench: Benchmarking LLMs with challenging tasks from real users in the wild. InICLR, 2024

  22. [22]

    EvalCrafter: Benchmarking and evaluating large video generation models

    Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. EvalCrafter: Benchmarking and evaluating large video generation models. InCVPR, 2024

  23. [23]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021

  24. [24]

    VQQA: An agentic approach for video evaluation and quality improvement.arXiv preprint arXiv:2603.12310, 2026

    Yiwen Song, Tomas Pfister, and Yale Song. VQQA: An agentic approach for video evaluation and quality improvement.arXiv preprint arXiv:2603.12310, 2026

  25. [25]

    T2V-CompBench: A comprehensive benchmark for compositional text-to-video generation

    Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2V-CompBench: A comprehensive benchmark for compositional text-to-video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  26. [26]

    RAFT: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. RAFT: Recurrent all-pairs field transforms for optical flow. InECCV, 2020

  27. [27]

    Towards accurate generative models of video: A new metric & challenges

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. InICLR Workshop on Deep Generative Models for Highly Structured Data, 2019

  28. [28]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan Team. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025

  29. [29]

    A very big video reasoning suite

    Maijunxian Wang, Zhongang Cai, et al. A Very Big Video Reasoning Suite.arXiv preprint arXiv:2602.20159, 2026. URLhttps://arxiv.org/abs/2602.20159

  30. [30]

    VidProM: A million-scale real prompt-gallery dataset for text-to-video diffusion models

    Wenhao Wang and Yi Yang. VidProM: A million-scale real prompt-gallery dataset for text-to-video diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  31. [31]

    Bovik, Hamid R

    Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P . Simoncelli. Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004. 32

  32. [32]

    Grit: A generative region-to-text transformer for object understanding

    Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, and Lijuan Wang. GRiT: A generative region-to-text transformer for object understanding.arXiv preprint arXiv:2212.00280, 2022

  33. [33]

    Jerrold H. Zar. Significance testing of the Spearman rank correlation coefficient.Journal of the American Statistical Association, 67(339):578–580, 1972

  34. [34]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

  35. [35]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P . Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena.Advances in Neural Information Processing Systems (NeurIPS), 36, 2023. A. Human Evals A.1. Annotation Protocol This secti...

  36. [36]

    All 300 video files are stored on Google Drive; the Apps Script backend maps each (prompt_id, model) tuple to a Drive file ID

    Confirm assets.VLM evaluation is complete for all50 prompts × 6 models = 300 videos. All 300 video files are stored on Google Drive; the Apps Script backend maps each (prompt_id, model) tuple to a Drive file ID

  37. [37]

    Left/right assignment is randomised independently per pair per session

    Pair generation.For each prompt, the interface automatically enumerates all 6 2 = 15 model pairings, yielding750 total pairs. Left/right assignment is randomised independently per pair per session. Already-completed pairs (stored in the annotator’s Google Sheet) are filtered out on resume so no pair is shown twice to the same annotator

  38. [38]

    All Done

    Session design.Each annotator’s queue contains all 750 pairs they have not yet personally judged, sorted by ascending global coverage (least-reviewed pairs first). A break overlay appears every 50 pairs. An annotator who completes all their remaining pairs sees an “All Done” screen

  39. [39]

    Video A” / “Video B

    Access control.Annotators identify themselves by entering their email address at session start. The interface normalises emails to lowercase for consistent history lookup. Drive folder access is granted at the folder level so the script can serve video blobs.Note: the publicly released dataset uses anonymized annotator IDs (A1–A7) in place of email addres...

  40. [40]

    Read the prompt first.Base your decision on prompt-faithfulness, not visual polish

  41. [41]

    Prioritise core action over background.Rank requirements mentally: Core Action / Physics → Characters →Background

  42. [42]

    Symmetric artifacts cancel.If both videos flicker or both clip, ignore that and judge whatdiffers

  43. [43]

    barely better

    Forced choice — no skips.Choose the more faithful attempt even if neither is perfect. UseSlightly better when the margin is thin. Decision Tree Can you tell the videos apart on prompt faithfulness? | ⊢One succeeds where the other clearly fails→Much better (e.g. core action done vs. not done; physics violated vs. correct) | ⊢Noticeable gap, but both have s...

  44. [44]

    Annotators click Continuewhen ready

    Session breaks.After every 50 pairs a break overlay appears with the pair count. Annotators click Continuewhen ready. Progress is saved continuously to Google Sheets, so closing the browser is safe

  45. [45]

    Loading. . . (slow connection)

    Slow connections.If a video does not buffer within 5 seconds, a “Loading. . . (slow connection)” hint is shown. After 12 seconds the client automatically re-fetches a fresh copy of the video from Drive. 35 Step 3 — Data Export and Aggregation

  46. [46]

    In the released anonymized dataset, timestamp is dropped and email is replaced by an opaque annotator ID (A1–A7)

    Google Sheets format.Each vote is appended as one row with columns: timestamp, email, prompt_id, model_a, model_b, winner, loser, confidence, source . In the released anonymized dataset, timestamp is dropped and email is replaced by an opaque annotator ID (A1–A7)

  47. [47]

    Inter-annotator agreement.For pairs judged by ≥ 2 annotators, compute mean IAA and Krippen- dorff’s𝛼

  48. [48]

    Report 95% bootstrap CIs (1,000 resamples)

    Human BT rating.Pool all comparisons and fit an unweighted Bradley-Terry model (each vote contributes one win/loss; confidence labels are used only in the PHAS step below) to obtain per- model Human BT rating anchored at 1500. Report 95% bootstrap CIs (1,000 resamples)

  49. [49]

    • Label 𝑦∈ { 0, 1}: 𝑦=1 if the annotator preferred 𝑚𝐴, 𝑦=0 otherwise; sample weight = confidence (Much/Clearly/Slightly→3/2/1)

    Calibrated PHAS weights.Using the 30-prompt calibration split (1,653 annotations; disjoint from the 20-prompt validation set), fit a non-negative constrained ridge logistic regression: • Feature vectorx ∈R 16: per-dimension VLM score difference 𝑥𝑑 =𝑠 𝑚𝐴,𝑝,𝑑 −𝑠 𝑚𝐵,𝑝,𝑑 for each applicable dimension; null-suitability dimensions are excluded (not zeroed) per ...

  50. [50]

    Human BT rating)

    Validation.Evaluate the calibrated weights on the held-out 20-prompt validation set (1,043 annota- tions): report pairwise prediction accuracy and PHAS model ranking (Spearman ˆ𝜌vs. Human BT rating). B. Prompt Curation B.1. Definitions Table 20 provides full definitions for all 16 evaluation dimensions. B.2. VidProM Filtering Pipeline The filtering pipeli...

  51. [51]

    NSFW/safety filter:VidProM’s built-in classifier removes sexually explicit, violent, and hateful content

  52. [52]

    3.Length filter:Prompts<30 characters or>500 characters are removed

    Deduplication:Exact-hash deduplication followed by MinHash/LSH near-duplicate removal with a Jaccard threshold of 0.8. 3.Length filter:Prompts<30 characters or>500 characters are removed

  53. [53]

    Prompts in the bottom quartile are dis- carded

    Complexity score:An LLM-estimated score rewards prompts involving physics interactions, multi- subject scenes, temporal events, and spatial relationships. Prompts in the bottom quartile are dis- carded

  54. [54]

    Blacklist:Prompts containing URLs, political figures, named celebrities, or trademarked properties are flagged and removed

  55. [55]

    These stages retain approximately 5,000 prompts (∼0.3% of the original corpus)

    Spam detection:Repetitive, malformed, or auto-generated prompts are removed via an n-gram-based classifier. These stages retain approximately 5,000 prompts (∼0.3% of the original corpus). Subsequent LLM judging further flags 276 (7.4%) for copyright/safety review, yielding the final set of 3,754 unique prompts. 36 Table 20|Complete dimension taxonomy used...

  56. [56]

    subject_consistency: Does the main character/object change shape, color, or identity during the video? - suitability: Does this prompt create conditions where subject inconsistency would be exposed? - difficulty: How hard for a video model to keep identity consistent?

  57. [57]

    scene_consistency: Does the environment stay stable or warp/melt? - suitability: Does this prompt expose scene warping during camera motion? - difficulty: How hard to keep scene stable during camera motion?

  58. [58]

    motion_smoothness: Does the video have stuttering or jitter? - suitability: Does the prompt expose frame skips? (fast/complex motion scores high) - difficulty: How hard for a model to render this motion smoothly? - NOTE: Rendering quality only — not physics

  59. [59]

    Intentional lighting changes are NOT flickering

    temporal_flickering: Are there flashes or brightness artifacts? - suitability: Score high only for complex textures (water, hair, fire, smoke, fine patterns). Intentional lighting changes are NOT flickering. - difficulty: How hard to avoid unwanted flickering?

  60. [60]

    - difficulty: How hard to render physically accurate inertia? - NOTE: Physics (velocity changes) only — not rendering smoothness

    inertial_consistency: Do objects follow laws of momentum? - suitability: Focus on velocity changes (falling, stopping, throwing, catching, sliding to a stop). - difficulty: How hard to render physically accurate inertia? - NOTE: Physics (velocity changes) only — not rendering smoothness. **Group B: Logic & Physics** Applicable if: prompt involves physical...

  61. [61]

    physical_mechanics: Do gravity, friction, and collisions look realistic?

  62. [62]

    object_permanence: If an object goes behind a wall, does it look the same when it reappears?

  63. [63]

    human_fidelity: Are humans rendered without alien artifacts (extra fingers, distorted faces, impossible body twisting)? Set to null if no humans in the prompt

  64. [64]

    dynamic_degree: Is there actual movement, or just a still image with zoom? 39 **Group C: Instruction Adherence** Applicable if: prompt has specific objects, colors, spatial relationships, or precise requirements

  65. [65]

    semantic_adherence: Does the video contain exactly what was asked?

  66. [66]

    spatial_relationship: Are objects in the right relative positions?

  67. [67]

    semantic_drift: Does the AI start following the prompt but "forget" it halfway through? **Group D: Aesthetic Quality** Applicable if: prompt involves specific artistic styles, high-detail environments, or cinematic descriptions

  68. [68]

    composition_framing: Is the shot well-balanced?

  69. [69]

    lighting_volumetric: Is the lighting realistic with depth?

  70. [70]

    color_harmony: Are the colors pleasing and consistent?

  71. [71]

    - Difficulty: 1 = easy for model, 5 = moderate, 10 = extremely hard

    structural_gestalt: Do elements look like they belong in the same world? Scoring guidelines: - Suitability: 1 = poor test, 5 = decent test, 10 = excellent/ideal. - Difficulty: 1 = easy for model, 5 = moderate, 10 = extremely hard. - Set scores to null for non-applicable dimensions. - Flag needs_review = true for harmful, policy-violating, or copyright-sen...

  72. [72]

    Fixing language: Correct grammar, spelling, improve coherence

  73. [73]

    Addressing weak dimensions: Add specific elements to boost weak dimensions (listed in the user message)

  74. [74]

    individual water droplets

    Preserving core theme: Keep the main subject and concept EXACTLY as intended. Guidelines for weak dimensions (use specific, stress-testing details): - motion_smoothness: Add fast/complex motion (running, spinning, fast-moving objects). - temporal_flickering: Add complex textures (water, fire, hair, reflective surfaces). Specify high-frequency details like...

  75. [75]

    Generate 10 questions that specifically probe that dimension as it relates to this prompt

  76. [76]

    Does the character’s face distort when they turn?

    Questions should cover: - Expected events and details mentioned in the prompt. - Potential failure modes (e.g., "Does the character’s face distort when they turn?"). - Success modes (e.g., "Is the reflection on the water consistent with the light source?"). - Adversarial probing (checking for subtle inconsistencies)

  77. [77]

    question

    For each question, define a 1-5 scoring rubric: - 1: Major failure / Completely incorrect. - 2: Notable artifacts / Significant issues. - 3: Mediocre / passable but flawed. - 4: Good / minor imperfections only. - 5: Perfect / Flawless execution. Return ONLY a JSON object where keys are the dimension names and values are lists of 10 question objects. Each ...

  78. [78]

    {question_1} Rubric: {rubric_description_1}

  79. [79]

    {question_2} Rubric: {rubric_description_2}

  80. [80]

    score": X,

    {question_10} Rubric: {rubric_description_10} Answer each question with a score (1-5) and a short justification. Return ONLY a JSON list of objects: [{"score": X, "justification": "..."}, ...] D. Case Studies This appendix presents two contrasting case studies.Prompt_1732(§ D.1) is aclean example: both judges (Gemini/Claude) agree, rankings mirror the glo...

Showing first 80 references.