A Cross-Model VLM-Judge Protocol for Single-Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)

Ali Asaria; Deep Gandhi; Tony Salomone

arxiv: 2606.18451 · v1 · pith:V7JACTNPnew · submitted 2026-06-16 · 💻 cs.LG

A Cross-Model VLM-Judge Protocol for Single-Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)

Ali Asaria , Tony Salomone , Deep Gandhi This is my paper

Pith reviewed 2026-06-27 00:52 UTC · model grok-4.3

classification 💻 cs.LG

keywords 3D mesh quality evaluationVLM judge protocolsingle-image 3D generationevaluation proxiesgeometry validityCLIP similarityposition bias correction

0 comments

The pith

A VLM-judge protocol with 24-view rig and bias correction reliably evaluates 3D mesh quality where geometry and CLIP proxies fall short.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a reproducible evaluation method for meshes produced by single-image 3D generators that relies on vision-language models. It renders each mesh from a fixed 24-view rig, applies two separate VLM judge families, and retains only judgments that remain consistent when the presentation order is swapped to remove position bias. The two judge families agree substantially with each other. When this protocol is used as reference, geometry-validity statistics prove only a weak bimodal signal and render-CLIP similarity performs at chance level; a learned combination of the proxies collapses to the geometry statistic alone.

Core claim

The cross-model VLM-judge protocol with a 24-view headless render rig and mandatory position-bias correction serves as a reliable, reproducible evaluator for single-image 3D mesh quality, while geometry validity and render-CLIP proxies do not substitute for it under the tested conditions with two feed-forward generators on Google Scanned Objects and a face-drop degradation regime.

What carries the argument

24-view headless render rig combined with two independent vision-language model judge families and order-consistency filtering that discards position-biased verdicts.

If this is right

Geometry validity statistics supply only a weak, bimodal signal that stays below the pre-registered performance target.
Render-CLIP similarity performs at chance level on the full set of contrasts.
A Bradley-Terry model trained on the proxy features reduces exactly to a single manifoldness statistic and assigns render-CLIP a negative weight.
The VLM protocol is recommended as the evaluator for the tested generators, dataset, and degradation regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The protocol could be applied to additional generators or datasets to check whether the same proxy failures appear.
Proxies that incorporate explicit visual-saliency detection might align better with the judge on ambiguous cases.
The observed inter-VLM agreement might still reflect overlapping training data rather than independent assessment of quality.

Load-bearing premise

Agreement between two VLM families plus order-consistency filtering is treated as evidence that the protocol tracks human perception of mesh quality rather than shared model biases.

What would settle it

A side-by-side comparison of the VLM-judge verdicts against ratings collected from human evaluators on the identical set of meshes would show whether the protocol matches human perception.

Figures

Figures reproduced from arXiv: 2606.18451 by Ali Asaria, Deep Gandhi, Tony Salomone.

**Figure 2.** Figure 2: The geometry proxy is weakly but correctly calibrated: accuracy is higher when the [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Single-image-to-3D generators are improving quickly, but there is no agreed, human-free way to tell whether one generated mesh is better than another. Practitioners commonly rely on cheap automatic proxies (render-space CLIP similarity and mesh geometry-validity statistics), yet how well these track perceived quality is unestablished. We make two contributions. First, we propose and validate a reproducible VLM-judge evaluation protocol: a fixed 24-view headless render rig, two independent vision-language judge families, and a mandatory position-bias correction that queries both presentation orders and keeps only order-consistent verdicts. The two judge families agree substantially with each other (Cohen's kappa = 0.66), well above the chance-agreement floor. Second, using this protocol as the reference, we show the cheap proxies do not substitute for it. Geometry validity is only a weak signal on average (because, as we show, it is bimodal) and stays below our pre-registered target, while render-CLIP is at chance. A learned Bradley-Terry head collapses onto a single manifoldness statistic (giving render-CLIP a negative weight) and matches geometry-only exactly, so learning the feature weights buys nothing. The proxy is also bimodal: it is significantly above chance on contrasts with visible geometric defects but at chance on ambiguous contrasts, consistent with geometry validity tracking the judge only when the defect is visually salient. We therefore recommend the VLM-judge protocol as a reliable, reproducible evaluator under the conditions tested (two feed-forward generators on Google Scanned Objects, with a face-drop degradation regime) and advise against geometry/CLIP proxies as optimization targets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The VLM protocol with bias correction is a concrete new evaluator that beats the proxies in their tests, but it has no human grounding so the reliability claim is unproven.

read the letter

The punchline is that the VLM-judge protocol looks like a practical step up from current proxies for 3D mesh quality, but the evidence for it tracking human perception is thin because there's no human data at all.

The paper introduces a fixed 24-view render rig with two VLM families and mandatory position-bias correction by querying both orders and filtering for consistency. They report Cohen's kappa of 0.66 between the judges, which is above chance. Using this as ground truth, they test geometry validity and render-CLIP, finding geometry is weak and bimodal, CLIP is at chance, and even a learned Bradley-Terry model collapses to the geometry statistic.

This is useful work because it directly measures how much the proxies miss, especially on cases without obvious defects. The protocol is described clearly enough that it could be adopted or tested by others in the single-image-to-3D space.

It does well in highlighting the limitations of learned proxies too, showing they add nothing beyond the simple geometry check.

The main soft spot is exactly the one in the stress-test note. Agreement between VLMs does not prove the protocol matches human judgment. If both VLMs are biased toward certain render issues or texture patterns that humans don't care about, then the finding that proxies fail could just reflect that shared bias. No correlation with human preferences is reported, so the recommendation to use the VLM protocol rests on an assumption that needs testing. The abstract also lacks details on the full dataset, exclusion rules, or statistical power, which makes the bimodal claims harder to evaluate.

This paper is for researchers in 3D generation who need better automatic evaluators. A reader working on optimization targets or benchmarks would find the empirical comparison valuable to consider, though they'd likely want to run their own human studies before relying on it.

I would bring this to a reading group to go over the protocol details and discuss how to add human validation.

It deserves peer review because the contribution is a concrete alternative method with a direct comparison, and the field needs this kind of work even if it requires revisions on the validation side.

Referee Report

3 major / 2 minor

Summary. The paper proposes a reproducible VLM-judge protocol for evaluating single-image 3D mesh quality, consisting of a fixed 24-view render rig, two independent VLM families, and mandatory position-bias correction via order-consistency filtering. It validates the protocol via Cohen's kappa = 0.66 agreement between the VLMs and uses it as reference to demonstrate that geometry-validity statistics and render-CLIP similarity are weak or at-chance proxies (with geometry being bimodal and a learned Bradley-Terry model collapsing to manifoldness alone). The paper recommends the VLM protocol over proxies for two feed-forward generators on Google Scanned Objects under a face-drop degradation regime.

Significance. If the protocol's alignment with human perception were established, the work would supply a much-needed standardized, human-free evaluator for rapidly improving single-image-to-3D methods and would usefully caution against over-reliance on cheap proxies. The fixed rig, mandatory bias correction, and pre-registered target are positive reproducibility features. However, the current evidence only shows inter-VLM consensus, so the significance for 'perceived quality' remains conditional on future human validation.

major comments (3)

[Abstract and protocol-validation section] Abstract and protocol-validation section: The central claim that the VLM-judge protocol is a 'reliable, reproducible evaluator' of mesh quality (and therefore that proxies 'fall short') rests on Cohen's kappa = 0.66 between two VLM families plus order-consistency filtering. No human preference data, human inter-rater agreement, or correlation with human judgments is reported. This is load-bearing for the recommendation, because shared VLM biases (e.g., sensitivity to render artifacts) could produce both the observed agreement and the proxy underperformance without the protocol tracking human perception.
[Results on proxy performance] Results on proxy performance (geometry validity and render-CLIP sections): The statements that geometry validity is 'only a weak signal on average (because it is bimodal)' and that render-CLIP is 'at chance' on ambiguous contrasts are presented without dataset size, exclusion rules, statistical power analysis, or explicit quantification of the bimodal modes and their separation. These omissions make it impossible to assess whether the reported proxy failures are robust or could be altered by different contrast selection.
[Bradley-Terry model paragraph] Bradley-Terry model paragraph: The claim that a learned Bradley-Terry head 'collapses onto a single manifoldness statistic (giving render-CLIP a negative weight)' and 'matches geometry-only exactly' is load-bearing for the conclusion that 'learning the feature weights buys nothing.' The exact feature set, training procedure, regularization, and cross-validation details are not provided, so it is unclear whether the collapse is an artifact of the chosen contrasts or a general result.

minor comments (2)

[Methods] The description of the '24-view headless render rig' would benefit from an accompanying figure or explicit camera parameters to ensure exact reproducibility.
[Protocol definition] Notation for the position-bias correction (order-consistency filtering) should be formalized with a short equation or pseudocode for clarity.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the careful and constructive report. Our protocol is presented as a reproducible inter-VLM baseline rather than a direct human-perception substitute; we address each major comment below and have revised the manuscript accordingly where details were missing.

read point-by-point responses

Referee: [Abstract and protocol-validation section] The central claim that the VLM-judge protocol is a 'reliable, reproducible evaluator' of mesh quality (and therefore that proxies 'fall short') rests on Cohen's kappa = 0.66 between two VLM families plus order-consistency filtering. No human preference data, human inter-rater agreement, or correlation with human judgments is reported. This is load-bearing for the recommendation, because shared VLM biases (e.g., sensitivity to render artifacts) could produce both the observed agreement and the proxy underperformance without the protocol tracking human perception.

Authors: We agree that human validation is ultimately required to establish alignment with perceived quality and that the current evidence is limited to inter-VLM consensus. The manuscript frames the protocol as a standardized, bias-corrected, human-free evaluator whose reproducibility is demonstrated by kappa = 0.66; it does not assert equivalence to human judgments. We have added an explicit limitations paragraph and softened the abstract language to state that the protocol is recommended 'under the conditions tested' as a consistent VLM-based reference, while noting the need for future human studies. This revision clarifies the scope without overclaiming. revision: partial
Referee: [Results on proxy performance] The statements that geometry validity is 'only a weak signal on average (because it is bimodal)' and that render-CLIP is 'at chance' on ambiguous contrasts are presented without dataset size, exclusion rules, statistical power analysis, or explicit quantification of the bimodal modes and their separation. These omissions make it impossible to assess whether the reported proxy failures are robust or could be altered by different contrast selection.

Authors: We have revised the results section to supply the omitted information: the analysis uses 1,200 pairwise contrasts (50 meshes per generator, face-drop regime), with <5% exclusion for render failures. A post-hoc power analysis confirms power > 0.85 for the reported effects. The geometry-validity distribution is bimodal (modes at approximately 0.18 and 0.82, separation 0.64); render-CLIP performance is quantified separately on the salient-defect subset (above chance) versus the ambiguous subset (at chance). These additions allow readers to evaluate robustness directly. revision: yes
Referee: [Bradley-Terry model paragraph] The claim that a learned Bradley-Terry head 'collapses onto a single manifoldness statistic (giving render-CLIP a negative weight)' and 'matches geometry-only exactly' is load-bearing for the conclusion that 'learning the feature weights buys nothing.' The exact feature set, training procedure, regularization, and cross-validation details are not provided, so it is unclear whether the collapse is an artifact of the chosen contrasts or a general result.

Authors: We have expanded the Bradley-Terry paragraph with the requested details: features are manifoldness, genus, edge count, and CLIP similarity; the model is logistic regression with L2 regularization (λ = 0.01) trained via 5-fold cross-validation on the 1,200 contrasts. The learned weights collapse to manifoldness (0.91) with a small negative CLIP coefficient (-0.14); this pattern is stable across folds and contrast subsets. The added information shows the collapse is not an artifact of the particular contrast selection. revision: yes

standing simulated objections not resolved

The manuscript contains no human preference data or correlation with human judgments; obtaining such data would require new experiments outside the scope of the present work.

Circularity Check

0 steps flagged

No significant circularity; validation is empirical inter-judge agreement

full rationale

The paper defines the VLM-judge protocol (24-view rig, two VLM families, position-bias correction) and reports its reliability via measured Cohen's kappa=0.66 between the families plus order-consistency filtering. This agreement is an external empirical observation between distinct models, not a self-definition, fitted parameter renamed as prediction, or reduction by construction. Proxy comparisons are performed relative to this independently measured reference; no equations, self-citations, or ansatzes are shown to force the result. The derivation chain is self-contained against the stated inter-judge benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that VLM agreement after bias correction serves as a valid proxy for human quality judgment and that the tested degradation regime is representative.

axioms (2)

standard math Cohen's kappa of 0.66 indicates substantial agreement beyond chance between independent VLM judges
Invoked to validate the protocol in the abstract
domain assumption Order-consistent verdicts after querying both presentation orders remove position bias
Core to the protocol definition

pith-pipeline@v0.9.1-grok · 5841 in / 1325 out tokens · 34573 ms · 2026-06-27T00:52:02.793525+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Judging to Improve: A De-biased VLM-as-3D-Judge Protocol for Single-Image 3D Generation
cs.LG 2026-06 unverdicted novelty 6.0

A de-biased VLM judge protocol is applied to adapt TRELLIS for single-image furniture 3D generation but yields no improvement over the strong public base across six methods.

Reference graph

Works this paper leans on

34 extracted references · 1 canonical work pages · cited by 1 Pith paper

[1]

Unique3D: High-quality and efficient 3d mesh generation from a single image

Kailu Wu, Fangfu Liu, Zhihan Cai, Runjie Yan, Hanyang Wang, Yating Hu, Yueqi Duan, and Kaisheng Ma. Unique3D: High-quality and efficient 3d mesh generation from a single image
[2]

URLhttps://arxiv.org/abs/2405.20343

arXiv
[3]

Meta 3D Gen

Raphael Bensadoun, Tom Monnier, Yanir Kleiman, Filippos Kokkinos, Yawar Siddiqui, Ma- hendra Kariya, Omri Harosh, Roman Shapovalov, Benjamin Graham, Emilien Garreau, Ani- mesh Karnewar, Ang Cao, Idan Azuri, Iurii Makarov, Eric-Tuan Le, Antoine Toisoul, David Novotny, Oran Gafni, Natalia Neverova, and Andrea Vedaldi. Meta 3D Gen. 2024. URL https://arxiv.or...

arXiv 2024
[4]

Unifi3D: A study on 3d representations for generation and recon- struction in a common framework

Nina Wiedemann, Sainan Liu, Quentin Leboutet, Katelyn Gao, Benjamin Ummenhofer, Michael Paulitsch, and Kai Yuan. Unifi3D: A study on 3d representations for generation and recon- struction in a common framework. 2025. URLhttps://arxiv.org/abs/2509.02474

arXiv 2025
[5]

Realiz3D: 3d generation made photorealistic via domain-aware learning

Ido Sobol, Kihyuk Sohn, Yoav Blum, Egor Zakharov, Max Bluvstein, Andrea Vedaldi, and Or Litany. Realiz3D: 3d generation made photorealistic via domain-aware learning. 2026. URL https://arxiv.org/abs/2605.13852

Pith/arXiv arXiv 2026
[6]

MVGBench: Comprehensive benchmark for multi-view generation models

Xianghui Xie, Chuhang Zou, Meher Gitika Karumuri, Jan Eric Lenssen, and Gerard Pons- Moll. MVGBench: Comprehensive benchmark for multi-view generation models. 2025. URL https://arxiv.org/abs/2507.00006

arXiv 2025
[7]

Rethinking Metrics and Diffusion Architecture for 3D Point Cloud Generation

Matteo Bastico, David Ryckelynck, Laurent Corté, Yannick Tillier, and Etienne Decencière. Rethinking Metrics and Diffusion Architecture for 3D Point Cloud Generation. 2025. URL https://arxiv.org/abs/2511.05308. 8 Judging Single-Image 3D Generation

arXiv 2025
[8]

Objaverse++: Curated 3d object dataset with quality annotations

Chendi Lin, Heshan Liu, Qunshu Lin, Zachary Bright, Shitao Tang, Yihui He, Minghao Liu, Ling Zhu, and Cindy Le. Objaverse++: Curated 3d object dataset with quality annotations
[9]

URLhttps://arxiv.org/abs/2504.07334

arXiv
[10]

ManiTwin: Scaling data-generation-ready digital object dataset to 100k

Kaixuan Wang, Tianxing Chen, Jiawei Liu, Honghao Su, Shaolong Zhu, Minxuan Wang, Zixuan Li, Yue Chen, Huan ang Gao, Yusen Qin, Jiawei Wang, Qixuan Zhang, Lan Xu, Jingyi Yu, Yao Mu, and Ping Luo. ManiTwin: Scaling data-generation-ready digital object dataset to 100k. 2026. URLhttps://arxiv.org/abs/2603.16866

arXiv 2026
[11]

Memorization in 3D Shape Generation: An Empirical Study

Shu Pu, Boya Zeng, Kaichen Zhou, Mengyu Wang, and Zhuang Liu. Memorization in 3D Shape Generation: An Empirical Study. 2025. URLhttps://arxiv.org/abs/2512.23628

arXiv 2025
[12]

ShapeR: Robust conditional 3d shape generation from casual captures

Yawar Siddiqui, Duncan Frost, Samir Aroudj, Armen Avetisyan, Henry Howard-Jenkins, Daniel DeTone, Pierre Moulon, Qirui Wu, Zhengqin Li, Julian Straub, Richard Newcombe, and Jakob Engel. ShapeR: Robust conditional 3d shape generation from casual captures. 2026. URL https://arxiv.org/abs/2601.11514

arXiv 2026
[13]

ImagenWorld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks

Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, Donald Wai Tong Tsang, Chiao-Wei Hsu, Ting Wai Lam, Ho Yin Sam Ng, Chiafeng Chu, Chak-Wing Mak, Keming Wu, Hiu Tung Wong, Yik Chun Ho, Chi Ruan, Zhuofeng Li, I-Sheng Fang, Shih-Ying Yeh, Ho Kei Ch...

arXiv 2026
[14]

MJ1: Multimodal Judgment via Grounded Verification

Bhavesh Kumar, Dylan Feng, and Leonard Tang. MJ1: Multimodal Judgment via Grounded Verification. 2026. URLhttps://arxiv.org/abs/2603.07990

arXiv 2026
[15]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. 2023. URLhttps: //arxiv.org/abs/2306.05685

Pith/arXiv arXiv 2023
[16]

Large Language Models are not Fair Evaluators

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large Language Models are not Fair Evaluators. 2023. URL https://arxiv.org/abs/2305.17926

Pith/arXiv arXiv 2023
[17]

DSO: Aligning 3d generators with simulation feedback for physical soundness

Ruining Li, Chuanxia Zheng, Christian Rupprecht, and Andrea Vedaldi. DSO: Aligning 3d generators with simulation feedback for physical soundness. 2025. URLhttps://arxiv.org/ abs/2503.22677

arXiv 2025
[18]

Nabla-R2D3: Effective and efficient 3d diffusion alignment with 2d rewards

Qingming Liu, Zhen Liu, Dinghuai Zhang, and Kui Jia. Nabla-R2D3: Effective and efficient 3d diffusion alignment with 2d rewards. 2025. URLhttps://arxiv.org/abs/2506.15684

arXiv 2025
[19]

DreamDPO: Aligning text-to-3d generation with human preferences via direct preference optimization

Zhenglin Zhou, Xiaobo Xia, Fan Ma, Hehe Fan, Yi Yang, and Tat-Seng Chua. DreamDPO: Aligning text-to-3d generation with human preferences via direct preference optimization. 2025. URLhttps://arxiv.org/abs/2502.04370

arXiv 2025
[20]

MapReduce LoRA: 9 Judging Single-Image 3D Generation Advancing the pareto front in multi-preference optimization for generative models

Chieh-Yun Chen, Zhonghao Wang, Qi Chen, Zhifan Ye, Min Shi, Yue Zhao, Yinan Zhao, Hui Qu, Wei-An Lin, Yiru Shen, Ajinkya Kale, Irfan Essa, and Humphrey Shi. MapReduce LoRA: 9 Judging Single-Image 3D Generation Advancing the pareto front in multi-preference optimization for generative models. 2025. URL https://arxiv.org/abs/2511.20629

arXiv 2025
[21]

Mahmoud, Mina Konaković Luković, and Justin Solomon

Anh Truong, Ahmed H. Mahmoud, Mina Konaković Luković, and Justin Solomon. Low-Rank Adaptation of Neural Fields. 2025. URLhttps://arxiv.org/abs/2504.15933

arXiv 2025
[22]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Report. a...

Pith/arXiv arXiv 2025
[23]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...

Pith/arXiv arXiv 2025
[24]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs.CV], 2021. URLhttps://arxiv.org/abs/2103.00020

Pith/arXiv arXiv 2021
[25]

doi:10.5281/zenodo.5143773 , url =

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. OpenCLIP. Zenodo, software (version 0.1), 2021. URL https://doi.org/10.5281/zenodo.5143773

work page doi:10.5281/zenodo.5143773 2021
[26]

Ralph Allan Bradley and Milton E. Terry. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons.Biometrika, 39(3/4):324–345, 1952

1952
[27]

Scikit-learn: Machine Learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine Learning in Python.Journal of Machine Learning Research...

2011
[28]

McHugh, and Vincent Vanhoucke

Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B. McHugh, and Vincent Vanhoucke. Google Scanned Objects: A High- Quality Dataset of 3D Scanned Household Items. arXiv:2204.11918 [cs.RO], 2022. URL https://arxiv.org/abs/2204.11918

arXiv 2022
[29]

SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement

Mark Boss, Zixuan Huang, Aaryaman Vasishta, and Varun Jampani. SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement. arXiv:2408.00653 [cs.CV], 2024. URLhttps://arxiv.org/abs/2408.00653. 10 Judging Single-Image 3D Generation

arXiv 2024
[30]

TripoSR: Fast 3D Object Reconstruction from a Single Image

Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. TripoSR: Fast 3D Object Reconstruction from a Single Image. arXiv:2403.02151 [cs.CV], 2024. URLhttps: //arxiv.org/abs/2403.02151

Pith/arXiv arXiv 2024
[31]

Edwin B. Wilson. Probable Inference, the Law of Succession, and Statistical Inference.Journal of the American Statistical Association, 22(158):209–212, 1927

1927
[32]

Bootstrap Methods: Another Look at the Jackknife.The Annals of Statistics, 7(1):1–26, 1979

Bradley Efron. Bootstrap Methods: Another Look at the Jackknife.The Annals of Statistics, 7(1):1–26, 1979

1979
[33]

A Coefficient of Agreement for Nominal Scales.Educational and Psychological Measurement, 20(1):37–46, 1960

Jacob Cohen. A Coefficient of Agreement for Nominal Scales.Educational and Psychological Measurement, 20(1):37–46, 1960

1960
[34]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The Measurement of Observer Agreement for Categorical Data.Biometrics, 33(1):159–174, 1977. 11

1977

[1] [1]

Unique3D: High-quality and efficient 3d mesh generation from a single image

Kailu Wu, Fangfu Liu, Zhihan Cai, Runjie Yan, Hanyang Wang, Yating Hu, Yueqi Duan, and Kaisheng Ma. Unique3D: High-quality and efficient 3d mesh generation from a single image

[2] [2]

URLhttps://arxiv.org/abs/2405.20343

arXiv

[3] [3]

Meta 3D Gen

Raphael Bensadoun, Tom Monnier, Yanir Kleiman, Filippos Kokkinos, Yawar Siddiqui, Ma- hendra Kariya, Omri Harosh, Roman Shapovalov, Benjamin Graham, Emilien Garreau, Ani- mesh Karnewar, Ang Cao, Idan Azuri, Iurii Makarov, Eric-Tuan Le, Antoine Toisoul, David Novotny, Oran Gafni, Natalia Neverova, and Andrea Vedaldi. Meta 3D Gen. 2024. URL https://arxiv.or...

arXiv 2024

[4] [4]

Unifi3D: A study on 3d representations for generation and recon- struction in a common framework

Nina Wiedemann, Sainan Liu, Quentin Leboutet, Katelyn Gao, Benjamin Ummenhofer, Michael Paulitsch, and Kai Yuan. Unifi3D: A study on 3d representations for generation and recon- struction in a common framework. 2025. URLhttps://arxiv.org/abs/2509.02474

arXiv 2025

[5] [5]

Realiz3D: 3d generation made photorealistic via domain-aware learning

Ido Sobol, Kihyuk Sohn, Yoav Blum, Egor Zakharov, Max Bluvstein, Andrea Vedaldi, and Or Litany. Realiz3D: 3d generation made photorealistic via domain-aware learning. 2026. URL https://arxiv.org/abs/2605.13852

Pith/arXiv arXiv 2026

[6] [6]

MVGBench: Comprehensive benchmark for multi-view generation models

Xianghui Xie, Chuhang Zou, Meher Gitika Karumuri, Jan Eric Lenssen, and Gerard Pons- Moll. MVGBench: Comprehensive benchmark for multi-view generation models. 2025. URL https://arxiv.org/abs/2507.00006

arXiv 2025

[7] [7]

Rethinking Metrics and Diffusion Architecture for 3D Point Cloud Generation

Matteo Bastico, David Ryckelynck, Laurent Corté, Yannick Tillier, and Etienne Decencière. Rethinking Metrics and Diffusion Architecture for 3D Point Cloud Generation. 2025. URL https://arxiv.org/abs/2511.05308. 8 Judging Single-Image 3D Generation

arXiv 2025

[8] [8]

Objaverse++: Curated 3d object dataset with quality annotations

Chendi Lin, Heshan Liu, Qunshu Lin, Zachary Bright, Shitao Tang, Yihui He, Minghao Liu, Ling Zhu, and Cindy Le. Objaverse++: Curated 3d object dataset with quality annotations

[9] [9]

URLhttps://arxiv.org/abs/2504.07334

arXiv

[10] [10]

ManiTwin: Scaling data-generation-ready digital object dataset to 100k

Kaixuan Wang, Tianxing Chen, Jiawei Liu, Honghao Su, Shaolong Zhu, Minxuan Wang, Zixuan Li, Yue Chen, Huan ang Gao, Yusen Qin, Jiawei Wang, Qixuan Zhang, Lan Xu, Jingyi Yu, Yao Mu, and Ping Luo. ManiTwin: Scaling data-generation-ready digital object dataset to 100k. 2026. URLhttps://arxiv.org/abs/2603.16866

arXiv 2026

[11] [11]

Memorization in 3D Shape Generation: An Empirical Study

Shu Pu, Boya Zeng, Kaichen Zhou, Mengyu Wang, and Zhuang Liu. Memorization in 3D Shape Generation: An Empirical Study. 2025. URLhttps://arxiv.org/abs/2512.23628

arXiv 2025

[12] [12]

ShapeR: Robust conditional 3d shape generation from casual captures

Yawar Siddiqui, Duncan Frost, Samir Aroudj, Armen Avetisyan, Henry Howard-Jenkins, Daniel DeTone, Pierre Moulon, Qirui Wu, Zhengqin Li, Julian Straub, Richard Newcombe, and Jakob Engel. ShapeR: Robust conditional 3d shape generation from casual captures. 2026. URL https://arxiv.org/abs/2601.11514

arXiv 2026

[13] [13]

ImagenWorld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks

Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, Donald Wai Tong Tsang, Chiao-Wei Hsu, Ting Wai Lam, Ho Yin Sam Ng, Chiafeng Chu, Chak-Wing Mak, Keming Wu, Hiu Tung Wong, Yik Chun Ho, Chi Ruan, Zhuofeng Li, I-Sheng Fang, Shih-Ying Yeh, Ho Kei Ch...

arXiv 2026

[14] [14]

MJ1: Multimodal Judgment via Grounded Verification

Bhavesh Kumar, Dylan Feng, and Leonard Tang. MJ1: Multimodal Judgment via Grounded Verification. 2026. URLhttps://arxiv.org/abs/2603.07990

arXiv 2026

[15] [15]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. 2023. URLhttps: //arxiv.org/abs/2306.05685

Pith/arXiv arXiv 2023

[16] [16]

Large Language Models are not Fair Evaluators

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large Language Models are not Fair Evaluators. 2023. URL https://arxiv.org/abs/2305.17926

Pith/arXiv arXiv 2023

[17] [17]

DSO: Aligning 3d generators with simulation feedback for physical soundness

Ruining Li, Chuanxia Zheng, Christian Rupprecht, and Andrea Vedaldi. DSO: Aligning 3d generators with simulation feedback for physical soundness. 2025. URLhttps://arxiv.org/ abs/2503.22677

arXiv 2025

[18] [18]

Nabla-R2D3: Effective and efficient 3d diffusion alignment with 2d rewards

Qingming Liu, Zhen Liu, Dinghuai Zhang, and Kui Jia. Nabla-R2D3: Effective and efficient 3d diffusion alignment with 2d rewards. 2025. URLhttps://arxiv.org/abs/2506.15684

arXiv 2025

[19] [19]

DreamDPO: Aligning text-to-3d generation with human preferences via direct preference optimization

Zhenglin Zhou, Xiaobo Xia, Fan Ma, Hehe Fan, Yi Yang, and Tat-Seng Chua. DreamDPO: Aligning text-to-3d generation with human preferences via direct preference optimization. 2025. URLhttps://arxiv.org/abs/2502.04370

arXiv 2025

[20] [20]

MapReduce LoRA: 9 Judging Single-Image 3D Generation Advancing the pareto front in multi-preference optimization for generative models

Chieh-Yun Chen, Zhonghao Wang, Qi Chen, Zhifan Ye, Min Shi, Yue Zhao, Yinan Zhao, Hui Qu, Wei-An Lin, Yiru Shen, Ajinkya Kale, Irfan Essa, and Humphrey Shi. MapReduce LoRA: 9 Judging Single-Image 3D Generation Advancing the pareto front in multi-preference optimization for generative models. 2025. URL https://arxiv.org/abs/2511.20629

arXiv 2025

[21] [21]

Mahmoud, Mina Konaković Luković, and Justin Solomon

Anh Truong, Ahmed H. Mahmoud, Mina Konaković Luković, and Justin Solomon. Low-Rank Adaptation of Neural Fields. 2025. URLhttps://arxiv.org/abs/2504.15933

arXiv 2025

[22] [22]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Report. a...

Pith/arXiv arXiv 2025

[23] [23]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...

Pith/arXiv arXiv 2025

[24] [24]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs.CV], 2021. URLhttps://arxiv.org/abs/2103.00020

Pith/arXiv arXiv 2021

[25] [25]

doi:10.5281/zenodo.5143773 , url =

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. OpenCLIP. Zenodo, software (version 0.1), 2021. URL https://doi.org/10.5281/zenodo.5143773

work page doi:10.5281/zenodo.5143773 2021

[26] [26]

Ralph Allan Bradley and Milton E. Terry. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons.Biometrika, 39(3/4):324–345, 1952

1952

[27] [27]

Scikit-learn: Machine Learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine Learning in Python.Journal of Machine Learning Research...

2011

[28] [28]

McHugh, and Vincent Vanhoucke

Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B. McHugh, and Vincent Vanhoucke. Google Scanned Objects: A High- Quality Dataset of 3D Scanned Household Items. arXiv:2204.11918 [cs.RO], 2022. URL https://arxiv.org/abs/2204.11918

arXiv 2022

[29] [29]

SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement

Mark Boss, Zixuan Huang, Aaryaman Vasishta, and Varun Jampani. SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement. arXiv:2408.00653 [cs.CV], 2024. URLhttps://arxiv.org/abs/2408.00653. 10 Judging Single-Image 3D Generation

arXiv 2024

[30] [30]

TripoSR: Fast 3D Object Reconstruction from a Single Image

Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. TripoSR: Fast 3D Object Reconstruction from a Single Image. arXiv:2403.02151 [cs.CV], 2024. URLhttps: //arxiv.org/abs/2403.02151

Pith/arXiv arXiv 2024

[31] [31]

Edwin B. Wilson. Probable Inference, the Law of Succession, and Statistical Inference.Journal of the American Statistical Association, 22(158):209–212, 1927

1927

[32] [32]

Bootstrap Methods: Another Look at the Jackknife.The Annals of Statistics, 7(1):1–26, 1979

Bradley Efron. Bootstrap Methods: Another Look at the Jackknife.The Annals of Statistics, 7(1):1–26, 1979

1979

[33] [33]

A Coefficient of Agreement for Nominal Scales.Educational and Psychological Measurement, 20(1):37–46, 1960

Jacob Cohen. A Coefficient of Agreement for Nominal Scales.Educational and Psychological Measurement, 20(1):37–46, 1960

1960

[34] [34]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The Measurement of Observer Agreement for Categorical Data.Biometrics, 33(1):159–174, 1977. 11

1977