Recognition: 3 theorem links
· Lean TheoremWorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models
Pith reviewed 2026-05-08 19:02 UTC · model grok-4.3
The pith
A VLM using Likert-scale questionnaires on native-resolution frames reproduces human three-tier video quality rankings with perfect correlation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WorldJen shows that a VLM-as-a-judge system, supplied with prompt-specific Likert questionnaires covering up to sixteen quality dimensions simultaneously, matches the three-tier structure of human Bradley-Terry ratings derived from 2,696 pairwise annotations, achieving Spearman correlation of 1.000.
What carries the argument
The VLM-as-a-judge engine that scores videos via dimension-specific Likert questionnaires (ten questions each) at native resolution, validated against human-derived Bradley-Terry ratings.
If this is right
- Generative video models can be ranked on multiple quality dimensions at once without generating separate videos for each dimension.
- VLM judges become usable as scalable stand-ins for human raters once the three-tier agreement is confirmed.
- Evaluation no longer depends on low-resolution binary auditors that overlook temporal inconsistencies.
- The six ablation studies demonstrate that the Likert format and native-resolution input are both required for the observed agreement.
Where Pith is reading between the lines
- The approach could support continuous automated leaderboards that update as new video models appear.
- Extending the same prompt curation and questionnaire design to longer videos or 3D content would test whether the tier agreement holds beyond the current 50-prompt scope.
Load-bearing premise
The human preference study with 66.9 percent inter-annotator agreement on fifty prompts supplies a stable ground-truth three-tier ranking.
What would settle it
Running the same VLM judge on a fresh set of prompts or models yields Spearman correlation below 0.9 with new human Bradley-Terry ratings.
read the original abstract
Evaluating generative video models remains an open problem. Reference-based metrics such as Structural Similarity Index Measure (SSIM) and Peak Signal to Noise Ratio (PSNR) reward pixel fidelity over semantic correctness, while Frechet Video Distance (FVD) favors distributional textures over physical plausibility. Binary Visual Question Answering (VQA) based benchmarks like VBench~2.0 are prone to yes-bias and rely on low-resolution auditors that miss temporal failures. Moreover, their prompts target a single dimension at a time, multiplying the number of videos required while still not guaranteeing reliable results. WorldJen addresses these limitations directly. Binary VQA is replaced with Likert-scale questionnaires graded by a VLM that receives frames at native video resolution. Video generation costs are addressed by using adversarially curated prompts that are designed to exercise up to 16 quality dimensions simultaneously. The framework is built around two interlocking contributions. First, A blind human preference study is conducted, accumulating (2,696 pairwise annotations from 7 annotators with 100% pair coverage over 50 of the curated prompts $\times$ 6 state-of-the-art video models. A mean inter-annotator agreement of 66.9% is achieved and the study establishes a human ground-truth Bradley-Terry (BT) rating with a three-tier structure. Second, A VLM-as-a-judge evaluation engine using prompt-specific, dimension-specific Likert questionnaires (10 questions per dimension, 47,160 scored responses) judges the videos and reproduces the human-established three-tier BT rating structure independently. The VLM achieves a Spearman $\hat{\rho}=1.000,~p=0.0014$ that is interpreted as tier agreement with the human results. Six focused ablation studies validate the robustness of the VLM evaluation framework. Project page: https://moonmath.ai/worldjen/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WorldJen, a benchmark for generative video models that replaces binary VQA and reference-based metrics with a VLM-as-judge using prompt-specific Likert-scale questionnaires across 16 dimensions on adversarially curated prompts. It reports a human preference study with 2,696 pairwise annotations over 50 prompts and 6 models (66.9% mean inter-annotator agreement) that yields a Bradley-Terry model with a three-tier structure, and claims the VLM reproduces this exact tier structure with Spearman ρ̂=1.000 (p=0.0014) based on 47,160 scored responses, validated by six ablation studies.
Significance. If the validation holds, the work would be significant for the field by providing a scalable, multi-dimensional evaluation framework that aligns with human judgments on semantic and temporal aspects at native resolution, addressing documented weaknesses in FVD, SSIM, and binary VQA benchmarks. The independent human study and emphasis on simultaneous dimension coverage are clear strengths; however, the small scale of the human data limits the strength of claims about the reliability of the 16 individual dimensions.
major comments (3)
- [validation section] Human preference study (described in the validation section): The ground-truth three-tier BT structure is derived from only 6 models and 50 prompts with 66.9% inter-annotator agreement, producing a low-resolution, noisy signal. The reported perfect Spearman ρ̂=1.000 (p=0.0014) on this coarsened aggregate ranking does not establish that the VLM's 16 independent Likert scales reliably measure the intended dimensions rather than merely recovering the coarse ordering.
- [results section] VLM evaluation engine (results section): With only 6 models, the tier agreement test has very low statistical power; the p=0.0014 does not rule out that the VLM is capturing broad quality signals rather than the 16 distinct dimensions, especially given the absence of per-dimension human correlations or inter-dimension consistency checks.
- [section 6] Ablation studies (section 6): The six ablations are cited as validating robustness, but without quantitative details on how they isolate VLM prompt biases, Likert scale calibration, or sensitivity to the adversarial prompt curation, it is unclear whether they address the core concern of circularity between VLM judgments and the human tiers.
minor comments (2)
- [Abstract] The abstract would benefit from a brief statement on the exact VLM model, temperature settings, and prompt template used for the Likert questionnaires to improve reproducibility.
- [validation section] Notation for the Bradley-Terry model and Spearman correlation should include the explicit formula or reference to avoid ambiguity in how ties in the three-tier structure are handled.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important considerations regarding scale and validation strength. We address each major comment point by point below, clarifying our methodology and indicating revisions where they strengthen the manuscript without altering core claims.
read point-by-point responses
-
Referee: [validation section] Human preference study (described in the validation section): The ground-truth three-tier BT structure is derived from only 6 models and 50 prompts with 66.9% inter-annotator agreement, producing a low-resolution, noisy signal. The reported perfect Spearman ρ̂=1.000 (p=0.0014) on this coarsened aggregate ranking does not establish that the VLM's 16 independent Likert scales reliably measure the intended dimensions rather than merely recovering the coarse ordering.
Authors: We agree the human study scale (50 prompts, 6 models, 66.9% agreement) yields a coarse three-tier BT structure, which is a deliberate design choice to achieve full pairwise coverage with 2,696 annotations. The perfect Spearman correlation validates alignment at the aggregate tier level, which is the primary claim. However, we acknowledge this does not directly prove independence of the 16 Likert scales, as human data consists of overall preferences rather than dimension-specific ratings. We will revise the validation section to explicitly state this scope and add a limitations paragraph discussing the aggregate nature of the human ground truth. revision: partial
-
Referee: [results section] VLM evaluation engine (results section): With only 6 models, the tier agreement test has very low statistical power; the p=0.0014 does not rule out that the VLM is capturing broad quality signals rather than the 16 distinct dimensions, especially given the absence of per-dimension human correlations or inter-dimension consistency checks.
Authors: The small number of models limits statistical power for the tier test, and we will note this explicitly. The p=0.0014 reflects the exact ordering match on the coarsened tiers. To support distinct dimensions, the manuscript includes inter-dimension correlation analysis in the ablations (showing average pairwise correlations below 0.3 across dimensions), indicating they capture non-redundant signals. We cannot provide per-dimension human correlations, as the preference study collected holistic pairwise judgments. We will add this clarification and the inter-dimension results to the results section. revision: partial
-
Referee: [section 6] Ablation studies (section 6): The six ablations are cited as validating robustness, but without quantitative details on how they isolate VLM prompt biases, Likert scale calibration, or sensitivity to the adversarial prompt curation, it is unclear whether they address the core concern of circularity between VLM judgments and the human tiers.
Authors: We will expand Section 6 with quantitative ablation results, including: (1) correlation shifts when using non-adversarial prompts to isolate curation effects; (2) Likert scale sensitivity tests via rescaling experiments; and (3) bias checks via prompt perturbation. These demonstrate that VLM scores remain stable and independent of the human tier derivation process, addressing circularity concerns. The ablations were designed to test robustness without relying on the human data. revision: yes
- The human preference study collected only overall pairwise preferences and does not include per-dimension ratings, so direct validation of each of the 16 VLM Likert scales against human judgments on individual dimensions cannot be performed without a new, larger study.
Circularity Check
No significant circularity; VLM validation uses independent human Bradley-Terry ground truth
full rationale
The paper's central derivation consists of two independent stages: (1) a human preference study collecting 2,696 pairwise annotations over 50 prompts and 6 models to fit a Bradley-Terry model and derive a three-tier structure, and (2) a separate VLM judge producing 47,160 Likert-scale responses that is then compared to the human tiers via Spearman correlation. The VLM outputs are not fitted to the human data, nor are any parameters or definitions circularly interdependent. No self-citations appear as load-bearing premises, no uniqueness theorems are imported from prior author work, and no ansatz or renaming reduces the claimed result to its inputs by construction. The human study functions as an external benchmark rather than a self-referential fit.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A VLM can accurately grade video frames on Likert scales for multiple quality dimensions at native resolution
- standard math The Bradley-Terry model applied to pairwise human annotations yields a reliable three-tier ranking of video models
Lean theorems connected to this paper
-
Cost.FunctionalEquation (no overlap)washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Strengths are estimated via the Minorization-Maximization (MM) algorithm... BT rating_i = 1500 + 400 log_10(p_i / p̄_geom)
-
Cost (J-cost)Jcost_unit0 unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PHAS(m) = (1/|P|) Σ_p Σ_d w_d s_{m,p,d} / Σ_d w_d × λ(m,p), where w_d are calibrated by non-negative ridge logistic regression on human preference annotations.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
VideoPhy: Evaluating physical commonsense for video generation
Hritik Bansal, Zongyu Lee, Xinkai Ma, Vikram Li, Aditya Grover, Kai-Wei Chang, and Nanyun Peng. VideoPhy: Evaluating physical commonsense for video generation. InInternational Conference on Learning Representations (ICLR), 2025
2025
-
[2]
Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952
1952
-
[3]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV, 2021
2021
-
[4]
Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhuang, Zhanghao Wu, Yonghao Zhuang, Joseph E
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios N. Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhuang, Zhanghao Wu, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P . Xing. Chatbot arena: An open platform for evaluating LLMs by human preference. InICML, 2024
2024
-
[5]
fal.ai: Fast inference for generative AI.https://fal.ai, 2024
fal.ai. fal.ai: Fast inference for generative AI.https://fal.ai, 2024
2024
-
[6]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page Pith review arXiv 2023
-
[7]
Goodhart
Charles A.E. Goodhart. Problems of monetary management: The UK experience.Papers in Monetary Economics, 1975
1975
-
[8]
Veo 3: State-of-the-art video generation
Google DeepMind. Veo 3: State-of-the-art video generation. https://deepmind.google/tech nologies/veo/, 2025
2025
-
[9]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Amit Brazowski, Neta Shaul, Omer Berman, Daniel Peleg, Idan Leshem, Uriel Singer, Dana Tamir, David Grabli, et al. LTX-Video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103, 2025
work page internal anchor Pith review arXiv 2025
-
[10]
Tag2Text: Guiding vision-language model via image tagging
Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, and Lei Zhang. Tag2Text: Guiding vision-language model via image tagging. InICLR, 2024
2024
-
[11]
VBench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognit...
2024
-
[12]
Vbench++: Comprehensive and versatile benchmark suite for video generative models
Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench++: Comprehensive and versatile benchmark suite for video generative models. arXiv preprint arXiv:2411.13503, 2024
-
[13]
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
Ziqi Huang, Fan Zhang, Xiaojie Luo, Chenyang Si, Yinan He, et al. VBench-2.0: Advancing video generation benchmark with intrinsic faithfulness. arXiv preprint arXiv:2503.21755, 2025. 31
work page internal anchor Pith review arXiv 2025
-
[14]
MM algorithms for generalized Bradley-Terry models.The Annals of Statistics, 32 (1):384–406, 2004
David R Hunter. MM algorithms for generalized Bradley-Terry models.The Annals of Statistics, 32 (1):384–406, 2004
2004
-
[15]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, et al. HunyuanVideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024
work page Pith review arXiv 2024
-
[16]
A guideline of selecting and reporting intraclass correlation coefficients for reliability research.Journal of Chiropractic Medicine, 15(2):155–163, 2016
Terry K Koo and Mae Y Li. A guideline of selecting and reporting intraclass correlation coefficients for reliability research.Journal of Chiropractic Medicine, 15(2):155–163, 2016
2016
-
[17]
Sage Publications, 4th edition, 2018
Klaus Krippendorff.Content Analysis: An Introduction to Its Methodology. Sage Publications, 4th edition, 2018
2018
-
[18]
Kling: A generative video foundation model
Kuaishou Technology. Kling: A generative video foundation model. https://kling.kuaishou .com, 2024
2024
-
[19]
Richard Landis and Gary G
J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159–174, 1977
1977
-
[20]
Hashimoto
Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. AlpacaEval: An automatic evaluator of instruction-following models.https://github.com/tatsu-lab/alpaca_eval, 2023
2023
-
[21]
WildBench: Benchmarking LLMs with challenging tasks from real users in the wild
Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Bhatt, Abhilasha Ravichander, et al. WildBench: Benchmarking LLMs with challenging tasks from real users in the wild. InICLR, 2024
2024
-
[22]
EvalCrafter: Benchmarking and evaluating large video generation models
Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. EvalCrafter: Benchmarking and evaluating large video generation models. InCVPR, 2024
2024
-
[23]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021
2021
-
[24]
Yiwen Song, Tomas Pfister, and Yale Song. VQQA: An agentic approach for video evaluation and quality improvement.arXiv preprint arXiv:2603.12310, 2026
-
[25]
T2V-CompBench: A comprehensive benchmark for compositional text-to-video generation
Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2V-CompBench: A comprehensive benchmark for compositional text-to-video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
2025
-
[26]
RAFT: Recurrent all-pairs field transforms for optical flow
Zachary Teed and Jia Deng. RAFT: Recurrent all-pairs field transforms for optical flow. InECCV, 2020
2020
-
[27]
Towards accurate generative models of video: A new metric & challenges
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. InICLR Workshop on Deep Generative Models for Highly Structured Data, 2019
2019
-
[28]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan Team. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025
work page Pith review arXiv 2025
-
[29]
A very big video reasoning suite
Maijunxian Wang, Zhongang Cai, et al. A Very Big Video Reasoning Suite.arXiv preprint arXiv:2602.20159, 2026. URLhttps://arxiv.org/abs/2602.20159
-
[30]
VidProM: A million-scale real prompt-gallery dataset for text-to-video diffusion models
Wenhao Wang and Yi Yang. VidProM: A million-scale real prompt-gallery dataset for text-to-video diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[31]
Bovik, Hamid R
Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P . Simoncelli. Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004. 32
2004
-
[32]
Grit: A generative region-to-text transformer for object understanding
Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, and Lijuan Wang. GRiT: A generative region-to-text transformer for object understanding.arXiv preprint arXiv:2212.00280, 2022
-
[33]
Jerrold H. Zar. Significance testing of the Spearman rank correlation coefficient.Journal of the American Statistical Association, 67(339):578–580, 1972
1972
-
[34]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018
2018
-
[35]
Xing, Hao Zhang, Joseph E
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P . Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena.Advances in Neural Information Processing Systems (NeurIPS), 36, 2023. A. Human Evals A.1. Annotation Protocol This secti...
2023
-
[36]
All 300 video files are stored on Google Drive; the Apps Script backend maps each (prompt_id, model) tuple to a Drive file ID
Confirm assets.VLM evaluation is complete for all50 prompts × 6 models = 300 videos. All 300 video files are stored on Google Drive; the Apps Script backend maps each (prompt_id, model) tuple to a Drive file ID
-
[37]
Left/right assignment is randomised independently per pair per session
Pair generation.For each prompt, the interface automatically enumerates all 6 2 = 15 model pairings, yielding750 total pairs. Left/right assignment is randomised independently per pair per session. Already-completed pairs (stored in the annotator’s Google Sheet) are filtered out on resume so no pair is shown twice to the same annotator
-
[38]
All Done
Session design.Each annotator’s queue contains all 750 pairs they have not yet personally judged, sorted by ascending global coverage (least-reviewed pairs first). A break overlay appears every 50 pairs. An annotator who completes all their remaining pairs sees an “All Done” screen
-
[39]
Video A” / “Video B
Access control.Annotators identify themselves by entering their email address at session start. The interface normalises emails to lowercase for consistent history lookup. Drive folder access is granted at the folder level so the script can serve video blobs.Note: the publicly released dataset uses anonymized annotator IDs (A1–A7) in place of email addres...
-
[40]
Read the prompt first.Base your decision on prompt-faithfulness, not visual polish
-
[41]
Prioritise core action over background.Rank requirements mentally: Core Action / Physics → Characters →Background
-
[42]
Symmetric artifacts cancel.If both videos flicker or both clip, ignore that and judge whatdiffers
-
[43]
barely better
Forced choice — no skips.Choose the more faithful attempt even if neither is perfect. UseSlightly better when the margin is thin. Decision Tree Can you tell the videos apart on prompt faithfulness? | ⊢One succeeds where the other clearly fails→Much better (e.g. core action done vs. not done; physics violated vs. correct) | ⊢Noticeable gap, but both have s...
-
[44]
Annotators click Continuewhen ready
Session breaks.After every 50 pairs a break overlay appears with the pair count. Annotators click Continuewhen ready. Progress is saved continuously to Google Sheets, so closing the browser is safe
-
[45]
Loading. . . (slow connection)
Slow connections.If a video does not buffer within 5 seconds, a “Loading. . . (slow connection)” hint is shown. After 12 seconds the client automatically re-fetches a fresh copy of the video from Drive. 35 Step 3 — Data Export and Aggregation
-
[46]
In the released anonymized dataset, timestamp is dropped and email is replaced by an opaque annotator ID (A1–A7)
Google Sheets format.Each vote is appended as one row with columns: timestamp, email, prompt_id, model_a, model_b, winner, loser, confidence, source . In the released anonymized dataset, timestamp is dropped and email is replaced by an opaque annotator ID (A1–A7)
-
[47]
Inter-annotator agreement.For pairs judged by ≥ 2 annotators, compute mean IAA and Krippen- dorff’s𝛼
-
[48]
Report 95% bootstrap CIs (1,000 resamples)
Human BT rating.Pool all comparisons and fit an unweighted Bradley-Terry model (each vote contributes one win/loss; confidence labels are used only in the PHAS step below) to obtain per- model Human BT rating anchored at 1500. Report 95% bootstrap CIs (1,000 resamples)
-
[49]
• Label 𝑦∈ { 0, 1}: 𝑦=1 if the annotator preferred 𝑚𝐴, 𝑦=0 otherwise; sample weight = confidence (Much/Clearly/Slightly→3/2/1)
Calibrated PHAS weights.Using the 30-prompt calibration split (1,653 annotations; disjoint from the 20-prompt validation set), fit a non-negative constrained ridge logistic regression: • Feature vectorx ∈R 16: per-dimension VLM score difference 𝑥𝑑 =𝑠 𝑚𝐴,𝑝,𝑑 −𝑠 𝑚𝐵,𝑝,𝑑 for each applicable dimension; null-suitability dimensions are excluded (not zeroed) per ...
-
[50]
Human BT rating)
Validation.Evaluate the calibrated weights on the held-out 20-prompt validation set (1,043 annota- tions): report pairwise prediction accuracy and PHAS model ranking (Spearman ˆ𝜌vs. Human BT rating). B. Prompt Curation B.1. Definitions Table 20 provides full definitions for all 16 evaluation dimensions. B.2. VidProM Filtering Pipeline The filtering pipeli...
-
[51]
NSFW/safety filter:VidProM’s built-in classifier removes sexually explicit, violent, and hateful content
-
[52]
3.Length filter:Prompts<30 characters or>500 characters are removed
Deduplication:Exact-hash deduplication followed by MinHash/LSH near-duplicate removal with a Jaccard threshold of 0.8. 3.Length filter:Prompts<30 characters or>500 characters are removed
-
[53]
Prompts in the bottom quartile are dis- carded
Complexity score:An LLM-estimated score rewards prompts involving physics interactions, multi- subject scenes, temporal events, and spatial relationships. Prompts in the bottom quartile are dis- carded
-
[54]
Blacklist:Prompts containing URLs, political figures, named celebrities, or trademarked properties are flagged and removed
-
[55]
These stages retain approximately 5,000 prompts (∼0.3% of the original corpus)
Spam detection:Repetitive, malformed, or auto-generated prompts are removed via an n-gram-based classifier. These stages retain approximately 5,000 prompts (∼0.3% of the original corpus). Subsequent LLM judging further flags 276 (7.4%) for copyright/safety review, yielding the final set of 3,754 unique prompts. 36 Table 20|Complete dimension taxonomy used...
-
[56]
subject_consistency: Does the main character/object change shape, color, or identity during the video? - suitability: Does this prompt create conditions where subject inconsistency would be exposed? - difficulty: How hard for a video model to keep identity consistent?
-
[57]
scene_consistency: Does the environment stay stable or warp/melt? - suitability: Does this prompt expose scene warping during camera motion? - difficulty: How hard to keep scene stable during camera motion?
-
[58]
motion_smoothness: Does the video have stuttering or jitter? - suitability: Does the prompt expose frame skips? (fast/complex motion scores high) - difficulty: How hard for a model to render this motion smoothly? - NOTE: Rendering quality only — not physics
-
[59]
Intentional lighting changes are NOT flickering
temporal_flickering: Are there flashes or brightness artifacts? - suitability: Score high only for complex textures (water, hair, fire, smoke, fine patterns). Intentional lighting changes are NOT flickering. - difficulty: How hard to avoid unwanted flickering?
-
[60]
- difficulty: How hard to render physically accurate inertia? - NOTE: Physics (velocity changes) only — not rendering smoothness
inertial_consistency: Do objects follow laws of momentum? - suitability: Focus on velocity changes (falling, stopping, throwing, catching, sliding to a stop). - difficulty: How hard to render physically accurate inertia? - NOTE: Physics (velocity changes) only — not rendering smoothness. **Group B: Logic & Physics** Applicable if: prompt involves physical...
-
[61]
physical_mechanics: Do gravity, friction, and collisions look realistic?
-
[62]
object_permanence: If an object goes behind a wall, does it look the same when it reappears?
-
[63]
human_fidelity: Are humans rendered without alien artifacts (extra fingers, distorted faces, impossible body twisting)? Set to null if no humans in the prompt
-
[64]
dynamic_degree: Is there actual movement, or just a still image with zoom? 39 **Group C: Instruction Adherence** Applicable if: prompt has specific objects, colors, spatial relationships, or precise requirements
-
[65]
semantic_adherence: Does the video contain exactly what was asked?
-
[66]
spatial_relationship: Are objects in the right relative positions?
-
[67]
semantic_drift: Does the AI start following the prompt but "forget" it halfway through? **Group D: Aesthetic Quality** Applicable if: prompt involves specific artistic styles, high-detail environments, or cinematic descriptions
-
[68]
composition_framing: Is the shot well-balanced?
-
[69]
lighting_volumetric: Is the lighting realistic with depth?
-
[70]
color_harmony: Are the colors pleasing and consistent?
-
[71]
- Difficulty: 1 = easy for model, 5 = moderate, 10 = extremely hard
structural_gestalt: Do elements look like they belong in the same world? Scoring guidelines: - Suitability: 1 = poor test, 5 = decent test, 10 = excellent/ideal. - Difficulty: 1 = easy for model, 5 = moderate, 10 = extremely hard. - Set scores to null for non-applicable dimensions. - Flag needs_review = true for harmful, policy-violating, or copyright-sen...
-
[72]
Fixing language: Correct grammar, spelling, improve coherence
-
[73]
Addressing weak dimensions: Add specific elements to boost weak dimensions (listed in the user message)
-
[74]
individual water droplets
Preserving core theme: Keep the main subject and concept EXACTLY as intended. Guidelines for weak dimensions (use specific, stress-testing details): - motion_smoothness: Add fast/complex motion (running, spinning, fast-moving objects). - temporal_flickering: Add complex textures (water, fire, hair, reflective surfaces). Specify high-frequency details like...
-
[75]
Generate 10 questions that specifically probe that dimension as it relates to this prompt
-
[76]
Does the character’s face distort when they turn?
Questions should cover: - Expected events and details mentioned in the prompt. - Potential failure modes (e.g., "Does the character’s face distort when they turn?"). - Success modes (e.g., "Is the reflection on the water consistent with the light source?"). - Adversarial probing (checking for subtle inconsistencies)
-
[77]
question
For each question, define a 1-5 scoring rubric: - 1: Major failure / Completely incorrect. - 2: Notable artifacts / Significant issues. - 3: Mediocre / passable but flawed. - 4: Good / minor imperfections only. - 5: Perfect / Flawless execution. Return ONLY a JSON object where keys are the dimension names and values are lists of 10 question objects. Each ...
-
[78]
{question_1} Rubric: {rubric_description_1}
-
[79]
{question_2} Rubric: {rubric_description_2}
-
[80]
score": X,
{question_10} Rubric: {rubric_description_10} Answer each question with a score (1-5) and a short justification. Return ONLY a JSON list of objects: [{"score": X, "justification": "..."}, ...] D. Case Studies This appendix presents two contrasting case studies.Prompt_1732(§ D.1) is aclean example: both judges (Gemini/Claude) agree, rankings mirror the glo...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.