A Cross-Model VLM-Judge Protocol for Single-Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)
Pith reviewed 2026-06-27 00:52 UTC · model grok-4.3
The pith
A VLM-judge protocol with 24-view rig and bias correction reliably evaluates 3D mesh quality where geometry and CLIP proxies fall short.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The cross-model VLM-judge protocol with a 24-view headless render rig and mandatory position-bias correction serves as a reliable, reproducible evaluator for single-image 3D mesh quality, while geometry validity and render-CLIP proxies do not substitute for it under the tested conditions with two feed-forward generators on Google Scanned Objects and a face-drop degradation regime.
What carries the argument
24-view headless render rig combined with two independent vision-language model judge families and order-consistency filtering that discards position-biased verdicts.
If this is right
- Geometry validity statistics supply only a weak, bimodal signal that stays below the pre-registered performance target.
- Render-CLIP similarity performs at chance level on the full set of contrasts.
- A Bradley-Terry model trained on the proxy features reduces exactly to a single manifoldness statistic and assigns render-CLIP a negative weight.
- The VLM protocol is recommended as the evaluator for the tested generators, dataset, and degradation regime.
Where Pith is reading between the lines
- The protocol could be applied to additional generators or datasets to check whether the same proxy failures appear.
- Proxies that incorporate explicit visual-saliency detection might align better with the judge on ambiguous cases.
- The observed inter-VLM agreement might still reflect overlapping training data rather than independent assessment of quality.
Load-bearing premise
Agreement between two VLM families plus order-consistency filtering is treated as evidence that the protocol tracks human perception of mesh quality rather than shared model biases.
What would settle it
A side-by-side comparison of the VLM-judge verdicts against ratings collected from human evaluators on the identical set of meshes would show whether the protocol matches human perception.
Figures
read the original abstract
Single-image-to-3D generators are improving quickly, but there is no agreed, human-free way to tell whether one generated mesh is better than another. Practitioners commonly rely on cheap automatic proxies (render-space CLIP similarity and mesh geometry-validity statistics), yet how well these track perceived quality is unestablished. We make two contributions. First, we propose and validate a reproducible VLM-judge evaluation protocol: a fixed 24-view headless render rig, two independent vision-language judge families, and a mandatory position-bias correction that queries both presentation orders and keeps only order-consistent verdicts. The two judge families agree substantially with each other (Cohen's kappa = 0.66), well above the chance-agreement floor. Second, using this protocol as the reference, we show the cheap proxies do not substitute for it. Geometry validity is only a weak signal on average (because, as we show, it is bimodal) and stays below our pre-registered target, while render-CLIP is at chance. A learned Bradley-Terry head collapses onto a single manifoldness statistic (giving render-CLIP a negative weight) and matches geometry-only exactly, so learning the feature weights buys nothing. The proxy is also bimodal: it is significantly above chance on contrasts with visible geometric defects but at chance on ambiguous contrasts, consistent with geometry validity tracking the judge only when the defect is visually salient. We therefore recommend the VLM-judge protocol as a reliable, reproducible evaluator under the conditions tested (two feed-forward generators on Google Scanned Objects, with a face-drop degradation regime) and advise against geometry/CLIP proxies as optimization targets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a reproducible VLM-judge protocol for evaluating single-image 3D mesh quality, consisting of a fixed 24-view render rig, two independent VLM families, and mandatory position-bias correction via order-consistency filtering. It validates the protocol via Cohen's kappa = 0.66 agreement between the VLMs and uses it as reference to demonstrate that geometry-validity statistics and render-CLIP similarity are weak or at-chance proxies (with geometry being bimodal and a learned Bradley-Terry model collapsing to manifoldness alone). The paper recommends the VLM protocol over proxies for two feed-forward generators on Google Scanned Objects under a face-drop degradation regime.
Significance. If the protocol's alignment with human perception were established, the work would supply a much-needed standardized, human-free evaluator for rapidly improving single-image-to-3D methods and would usefully caution against over-reliance on cheap proxies. The fixed rig, mandatory bias correction, and pre-registered target are positive reproducibility features. However, the current evidence only shows inter-VLM consensus, so the significance for 'perceived quality' remains conditional on future human validation.
major comments (3)
- [Abstract and protocol-validation section] Abstract and protocol-validation section: The central claim that the VLM-judge protocol is a 'reliable, reproducible evaluator' of mesh quality (and therefore that proxies 'fall short') rests on Cohen's kappa = 0.66 between two VLM families plus order-consistency filtering. No human preference data, human inter-rater agreement, or correlation with human judgments is reported. This is load-bearing for the recommendation, because shared VLM biases (e.g., sensitivity to render artifacts) could produce both the observed agreement and the proxy underperformance without the protocol tracking human perception.
- [Results on proxy performance] Results on proxy performance (geometry validity and render-CLIP sections): The statements that geometry validity is 'only a weak signal on average (because it is bimodal)' and that render-CLIP is 'at chance' on ambiguous contrasts are presented without dataset size, exclusion rules, statistical power analysis, or explicit quantification of the bimodal modes and their separation. These omissions make it impossible to assess whether the reported proxy failures are robust or could be altered by different contrast selection.
- [Bradley-Terry model paragraph] Bradley-Terry model paragraph: The claim that a learned Bradley-Terry head 'collapses onto a single manifoldness statistic (giving render-CLIP a negative weight)' and 'matches geometry-only exactly' is load-bearing for the conclusion that 'learning the feature weights buys nothing.' The exact feature set, training procedure, regularization, and cross-validation details are not provided, so it is unclear whether the collapse is an artifact of the chosen contrasts or a general result.
minor comments (2)
- [Methods] The description of the '24-view headless render rig' would benefit from an accompanying figure or explicit camera parameters to ensure exact reproducibility.
- [Protocol definition] Notation for the position-bias correction (order-consistency filtering) should be formalized with a short equation or pseudocode for clarity.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive report. Our protocol is presented as a reproducible inter-VLM baseline rather than a direct human-perception substitute; we address each major comment below and have revised the manuscript accordingly where details were missing.
read point-by-point responses
-
Referee: [Abstract and protocol-validation section] The central claim that the VLM-judge protocol is a 'reliable, reproducible evaluator' of mesh quality (and therefore that proxies 'fall short') rests on Cohen's kappa = 0.66 between two VLM families plus order-consistency filtering. No human preference data, human inter-rater agreement, or correlation with human judgments is reported. This is load-bearing for the recommendation, because shared VLM biases (e.g., sensitivity to render artifacts) could produce both the observed agreement and the proxy underperformance without the protocol tracking human perception.
Authors: We agree that human validation is ultimately required to establish alignment with perceived quality and that the current evidence is limited to inter-VLM consensus. The manuscript frames the protocol as a standardized, bias-corrected, human-free evaluator whose reproducibility is demonstrated by kappa = 0.66; it does not assert equivalence to human judgments. We have added an explicit limitations paragraph and softened the abstract language to state that the protocol is recommended 'under the conditions tested' as a consistent VLM-based reference, while noting the need for future human studies. This revision clarifies the scope without overclaiming. revision: partial
-
Referee: [Results on proxy performance] The statements that geometry validity is 'only a weak signal on average (because it is bimodal)' and that render-CLIP is 'at chance' on ambiguous contrasts are presented without dataset size, exclusion rules, statistical power analysis, or explicit quantification of the bimodal modes and their separation. These omissions make it impossible to assess whether the reported proxy failures are robust or could be altered by different contrast selection.
Authors: We have revised the results section to supply the omitted information: the analysis uses 1,200 pairwise contrasts (50 meshes per generator, face-drop regime), with <5% exclusion for render failures. A post-hoc power analysis confirms power > 0.85 for the reported effects. The geometry-validity distribution is bimodal (modes at approximately 0.18 and 0.82, separation 0.64); render-CLIP performance is quantified separately on the salient-defect subset (above chance) versus the ambiguous subset (at chance). These additions allow readers to evaluate robustness directly. revision: yes
-
Referee: [Bradley-Terry model paragraph] The claim that a learned Bradley-Terry head 'collapses onto a single manifoldness statistic (giving render-CLIP a negative weight)' and 'matches geometry-only exactly' is load-bearing for the conclusion that 'learning the feature weights buys nothing.' The exact feature set, training procedure, regularization, and cross-validation details are not provided, so it is unclear whether the collapse is an artifact of the chosen contrasts or a general result.
Authors: We have expanded the Bradley-Terry paragraph with the requested details: features are manifoldness, genus, edge count, and CLIP similarity; the model is logistic regression with L2 regularization (λ = 0.01) trained via 5-fold cross-validation on the 1,200 contrasts. The learned weights collapse to manifoldness (0.91) with a small negative CLIP coefficient (-0.14); this pattern is stable across folds and contrast subsets. The added information shows the collapse is not an artifact of the particular contrast selection. revision: yes
- The manuscript contains no human preference data or correlation with human judgments; obtaining such data would require new experiments outside the scope of the present work.
Circularity Check
No significant circularity; validation is empirical inter-judge agreement
full rationale
The paper defines the VLM-judge protocol (24-view rig, two VLM families, position-bias correction) and reports its reliability via measured Cohen's kappa=0.66 between the families plus order-consistency filtering. This agreement is an external empirical observation between distinct models, not a self-definition, fitted parameter renamed as prediction, or reduction by construction. Proxy comparisons are performed relative to this independently measured reference; no equations, self-citations, or ansatzes are shown to force the result. The derivation chain is self-contained against the stated inter-judge benchmark.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Cohen's kappa of 0.66 indicates substantial agreement beyond chance between independent VLM judges
- domain assumption Order-consistent verdicts after querying both presentation orders remove position bias
Forward citations
Cited by 1 Pith paper
-
Judging to Improve: A De-biased VLM-as-3D-Judge Protocol for Single-Image 3D Generation
A de-biased VLM judge protocol is applied to adapt TRELLIS for single-image furniture 3D generation but yields no improvement over the strong public base across six methods.
Reference graph
Works this paper leans on
-
[1]
Unique3D: High-quality and efficient 3d mesh generation from a single image
Kailu Wu, Fangfu Liu, Zhihan Cai, Runjie Yan, Hanyang Wang, Yating Hu, Yueqi Duan, and Kaisheng Ma. Unique3D: High-quality and efficient 3d mesh generation from a single image
-
[2]
URLhttps://arxiv.org/abs/2405.20343
-
[3]
Raphael Bensadoun, Tom Monnier, Yanir Kleiman, Filippos Kokkinos, Yawar Siddiqui, Ma- hendra Kariya, Omri Harosh, Roman Shapovalov, Benjamin Graham, Emilien Garreau, Ani- mesh Karnewar, Ang Cao, Idan Azuri, Iurii Makarov, Eric-Tuan Le, Antoine Toisoul, David Novotny, Oran Gafni, Natalia Neverova, and Andrea Vedaldi. Meta 3D Gen. 2024. URL https://arxiv.or...
arXiv 2024
-
[4]
Unifi3D: A study on 3d representations for generation and recon- struction in a common framework
Nina Wiedemann, Sainan Liu, Quentin Leboutet, Katelyn Gao, Benjamin Ummenhofer, Michael Paulitsch, and Kai Yuan. Unifi3D: A study on 3d representations for generation and recon- struction in a common framework. 2025. URLhttps://arxiv.org/abs/2509.02474
arXiv 2025
-
[5]
Realiz3D: 3d generation made photorealistic via domain-aware learning
Ido Sobol, Kihyuk Sohn, Yoav Blum, Egor Zakharov, Max Bluvstein, Andrea Vedaldi, and Or Litany. Realiz3D: 3d generation made photorealistic via domain-aware learning. 2026. URL https://arxiv.org/abs/2605.13852
Pith/arXiv arXiv 2026
-
[6]
MVGBench: Comprehensive benchmark for multi-view generation models
Xianghui Xie, Chuhang Zou, Meher Gitika Karumuri, Jan Eric Lenssen, and Gerard Pons- Moll. MVGBench: Comprehensive benchmark for multi-view generation models. 2025. URL https://arxiv.org/abs/2507.00006
arXiv 2025
-
[7]
Rethinking Metrics and Diffusion Architecture for 3D Point Cloud Generation
Matteo Bastico, David Ryckelynck, Laurent Corté, Yannick Tillier, and Etienne Decencière. Rethinking Metrics and Diffusion Architecture for 3D Point Cloud Generation. 2025. URL https://arxiv.org/abs/2511.05308. 8 Judging Single-Image 3D Generation
arXiv 2025
-
[8]
Objaverse++: Curated 3d object dataset with quality annotations
Chendi Lin, Heshan Liu, Qunshu Lin, Zachary Bright, Shitao Tang, Yihui He, Minghao Liu, Ling Zhu, and Cindy Le. Objaverse++: Curated 3d object dataset with quality annotations
-
[9]
URLhttps://arxiv.org/abs/2504.07334
-
[10]
ManiTwin: Scaling data-generation-ready digital object dataset to 100k
Kaixuan Wang, Tianxing Chen, Jiawei Liu, Honghao Su, Shaolong Zhu, Minxuan Wang, Zixuan Li, Yue Chen, Huan ang Gao, Yusen Qin, Jiawei Wang, Qixuan Zhang, Lan Xu, Jingyi Yu, Yao Mu, and Ping Luo. ManiTwin: Scaling data-generation-ready digital object dataset to 100k. 2026. URLhttps://arxiv.org/abs/2603.16866
arXiv 2026
-
[11]
Memorization in 3D Shape Generation: An Empirical Study
Shu Pu, Boya Zeng, Kaichen Zhou, Mengyu Wang, and Zhuang Liu. Memorization in 3D Shape Generation: An Empirical Study. 2025. URLhttps://arxiv.org/abs/2512.23628
arXiv 2025
-
[12]
ShapeR: Robust conditional 3d shape generation from casual captures
Yawar Siddiqui, Duncan Frost, Samir Aroudj, Armen Avetisyan, Henry Howard-Jenkins, Daniel DeTone, Pierre Moulon, Qirui Wu, Zhengqin Li, Julian Straub, Richard Newcombe, and Jakob Engel. ShapeR: Robust conditional 3d shape generation from casual captures. 2026. URL https://arxiv.org/abs/2601.11514
arXiv 2026
-
[13]
Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, Donald Wai Tong Tsang, Chiao-Wei Hsu, Ting Wai Lam, Ho Yin Sam Ng, Chiafeng Chu, Chak-Wing Mak, Keming Wu, Hiu Tung Wong, Yik Chun Ho, Chi Ruan, Zhuofeng Li, I-Sheng Fang, Shih-Ying Yeh, Ho Kei Ch...
arXiv 2026
-
[14]
MJ1: Multimodal Judgment via Grounded Verification
Bhavesh Kumar, Dylan Feng, and Leonard Tang. MJ1: Multimodal Judgment via Grounded Verification. 2026. URLhttps://arxiv.org/abs/2603.07990
arXiv 2026
-
[15]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. 2023. URLhttps: //arxiv.org/abs/2306.05685
Pith/arXiv arXiv 2023
-
[16]
Large Language Models are not Fair Evaluators
Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large Language Models are not Fair Evaluators. 2023. URL https://arxiv.org/abs/2305.17926
Pith/arXiv arXiv 2023
-
[17]
DSO: Aligning 3d generators with simulation feedback for physical soundness
Ruining Li, Chuanxia Zheng, Christian Rupprecht, and Andrea Vedaldi. DSO: Aligning 3d generators with simulation feedback for physical soundness. 2025. URLhttps://arxiv.org/ abs/2503.22677
arXiv 2025
-
[18]
Nabla-R2D3: Effective and efficient 3d diffusion alignment with 2d rewards
Qingming Liu, Zhen Liu, Dinghuai Zhang, and Kui Jia. Nabla-R2D3: Effective and efficient 3d diffusion alignment with 2d rewards. 2025. URLhttps://arxiv.org/abs/2506.15684
arXiv 2025
-
[19]
DreamDPO: Aligning text-to-3d generation with human preferences via direct preference optimization
Zhenglin Zhou, Xiaobo Xia, Fan Ma, Hehe Fan, Yi Yang, and Tat-Seng Chua. DreamDPO: Aligning text-to-3d generation with human preferences via direct preference optimization. 2025. URLhttps://arxiv.org/abs/2502.04370
arXiv 2025
-
[20]
Chieh-Yun Chen, Zhonghao Wang, Qi Chen, Zhifan Ye, Min Shi, Yue Zhao, Yinan Zhao, Hui Qu, Wei-An Lin, Yiru Shen, Ajinkya Kale, Irfan Essa, and Humphrey Shi. MapReduce LoRA: 9 Judging Single-Image 3D Generation Advancing the pareto front in multi-preference optimization for generative models. 2025. URL https://arxiv.org/abs/2511.20629
arXiv 2025
-
[21]
Mahmoud, Mina Konaković Luković, and Justin Solomon
Anh Truong, Ahmed H. Mahmoud, Mina Konaković Luković, and Justin Solomon. Low-Rank Adaptation of Neural Fields. 2025. URLhttps://arxiv.org/abs/2504.15933
arXiv 2025
-
[22]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Report. a...
Pith/arXiv arXiv 2025
-
[23]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...
Pith/arXiv arXiv 2025
-
[24]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs.CV], 2021. URLhttps://arxiv.org/abs/2103.00020
Pith/arXiv arXiv 2021
-
[25]
doi:10.5281/zenodo.5143773 , url =
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. OpenCLIP. Zenodo, software (version 0.1), 2021. URL https://doi.org/10.5281/zenodo.5143773
-
[26]
Ralph Allan Bradley and Milton E. Terry. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons.Biometrika, 39(3/4):324–345, 1952
1952
-
[27]
Scikit-learn: Machine Learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine Learning in Python.Journal of Machine Learning Research...
2011
-
[28]
Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B. McHugh, and Vincent Vanhoucke. Google Scanned Objects: A High- Quality Dataset of 3D Scanned Household Items. arXiv:2204.11918 [cs.RO], 2022. URL https://arxiv.org/abs/2204.11918
arXiv 2022
-
[29]
SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement
Mark Boss, Zixuan Huang, Aaryaman Vasishta, and Varun Jampani. SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement. arXiv:2408.00653 [cs.CV], 2024. URLhttps://arxiv.org/abs/2408.00653. 10 Judging Single-Image 3D Generation
arXiv 2024
-
[30]
TripoSR: Fast 3D Object Reconstruction from a Single Image
Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. TripoSR: Fast 3D Object Reconstruction from a Single Image. arXiv:2403.02151 [cs.CV], 2024. URLhttps: //arxiv.org/abs/2403.02151
Pith/arXiv arXiv 2024
-
[31]
Edwin B. Wilson. Probable Inference, the Law of Succession, and Statistical Inference.Journal of the American Statistical Association, 22(158):209–212, 1927
1927
-
[32]
Bootstrap Methods: Another Look at the Jackknife.The Annals of Statistics, 7(1):1–26, 1979
Bradley Efron. Bootstrap Methods: Another Look at the Jackknife.The Annals of Statistics, 7(1):1–26, 1979
1979
-
[33]
A Coefficient of Agreement for Nominal Scales.Educational and Psychological Measurement, 20(1):37–46, 1960
Jacob Cohen. A Coefficient of Agreement for Nominal Scales.Educational and Psychological Measurement, 20(1):37–46, 1960
1960
-
[34]
Richard Landis and Gary G
J. Richard Landis and Gary G. Koch. The Measurement of Observer Agreement for Categorical Data.Biometrics, 33(1):159–174, 1977. 11
1977
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.