DB-3DME: From Dataset to Benchmark for Human-aligned Automatic 3D Mesh Evaluation
Pith reviewed 2026-06-27 16:44 UTC · model grok-4.3
The pith
Fine-tuning the visual encoder of Qwen-2.5-VL-7B on human-rated 3D meshes produces a model that outperforms pre-trained VLMs in matching human judgments on geometry and prompt adherence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DB-3DME supplies 2,619 synthetic 3D meshes together with human ratings on Geometry and Prompt Adherence. Systematic benchmarking of state-of-the-art VLMs on this data shows that the visual encoding of 3D representations is the decisive factor for human-aligned evaluation performance. Fine-tuning Qwen-2.5-VL-7B by adapting its visual encoder while freezing the language model yields a model that substantially outperforms existing pre-trained VLMs across multiple evaluation dimensions and thereby establishes a new benchmark for automatic 3D mesh evaluation.
What carries the argument
The DB-3DME dataset of 2,619 synthetic meshes with human ratings on Geometry and Prompt Adherence, used both to diagnose that visual encoding controls alignment and to fine-tune only the visual encoder of Qwen-2.5-VL-7B.
If this is right
- Visual encoding of 3D input is the main bottleneck that must be addressed to make VLMs match human 3D-mesh judgments.
- Freezing the language model while adapting only the visual encoder is sufficient to obtain large gains on this task.
- A public dataset of this size enables direct, reproducible comparison of any future automatic evaluator against human ratings.
- The fine-tuned model offers a scalable replacement for human raters when judging large numbers of generated meshes.
Where Pith is reading between the lines
- The same visual-encoder adaptation pattern could be tested on other 3D representations such as point clouds or neural radiance fields.
- Collecting ratings on real-world rather than synthetic meshes would test whether the current ground truth already captures the full distribution of generation artifacts.
- If the benchmark is adopted, 3D generation papers could begin reporting automatic scores alongside or instead of small-scale human studies.
Load-bearing premise
The 2,619 synthetic meshes and their human ratings on geometry and prompt adherence supply a representative ground truth that generalizes to other meshes and prompts.
What would settle it
If the fine-tuned model shows no improvement over pre-trained VLMs when scored against human ratings on a new collection of meshes outside the original 2,619, the claim of a new benchmark would not hold.
Figures
read the original abstract
Recent advances in 3D generation have led to substantial improvements in realism, controllability, and efficiency, yet the evaluation of 3D assets remains underexplored. Existing evaluation paradigms, including human evaluation, learned metrics, and vision-language models (VLMs) as judges, suffer from limitations in cost, scalability, resolution handling, or task-specific alignment. In this work, we focus on 3D mesh evaluation and introduce DB-3DME, the Dataset and Benchmark for 3D Mesh Evaluation. DB-3DME contains 2,619 synthetic 3D meshes paired with human ratings on Geometry and Prompt Adherence. Using this dataset, we systematically benchmark state-of-the-art VLMs and identify visual encoding of 3D representations as a key factor for human-aligned evaluation performance. Motivated by this finding, we fine-tune an open-weight VLM, Qwen-2.5-VL-7B, for 3D mesh evaluation by adapting the visual encoder while freezing the language model. The fine-tuned model substantially outperforms existing pre-trained VLMs across multiple evaluation dimensions, establishing a new benchmark for automatic 3D mesh evaluation. We publicly release the benchmark dataset on GitHub and Hugging Face to facilitate future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DB-3DME, a dataset of 2,619 synthetic 3D meshes paired with human ratings on Geometry and Prompt Adherence. It benchmarks state-of-the-art VLMs on this data, identifies visual encoding of 3D representations as a key factor for human-aligned evaluation, fine-tunes Qwen-2.5-VL-7B by adapting the visual encoder while freezing the language model, and claims that the resulting model substantially outperforms existing pre-trained VLMs across multiple dimensions, thereby establishing a new benchmark for automatic 3D mesh evaluation. The dataset is released publicly on GitHub and Hugging Face.
Significance. If the performance claims and generalizability hold, the work would be significant for the 3D generation community by supplying a publicly available human-aligned benchmark and demonstrating a practical fine-tuning approach that improves VLM-based evaluation. The emphasis on visual encoding and the dataset release are concrete contributions that could support reproducible progress in scalable 3D asset assessment.
major comments (2)
- [Abstract] Abstract: The central claim that the fine-tuned Qwen-2.5-VL-7B 'substantially outperforms existing pre-trained VLMs across multiple evaluation dimensions' is presented without any quantitative results, error bars, ablation details, or description of the rating collection protocol. This absence makes the data-to-claim link unverifiable and is load-bearing for the assertion that a new benchmark has been established.
- [Abstract] Abstract: The claim that the 2,619 synthetic meshes and associated human ratings on Geometry and Prompt Adherence constitute a representative ground truth for benchmarking and fine-tuning rests on an unexamined assumption of generalizability. No evidence is supplied regarding coverage of generation artifacts from diverse pipelines, inter-rater agreement statistics, or external validation, which directly affects whether the reported VLM improvements and benchmark status extend beyond this specific collection.
minor comments (1)
- The abstract would be strengthened by including at least one key quantitative comparison (e.g., correlation or accuracy delta) to allow readers to assess the performance claim immediately.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We address the two major comments point-by-point below and will revise the manuscript accordingly where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the fine-tuned Qwen-2.5-VL-7B 'substantially outperforms existing pre-trained VLMs across multiple evaluation dimensions' is presented without any quantitative results, error bars, ablation details, or description of the rating collection protocol. This absence makes the data-to-claim link unverifiable and is load-bearing for the assertion that a new benchmark has been established.
Authors: We agree that the abstract would be strengthened by including concrete quantitative support. In the revised version we will insert key metrics (e.g., Spearman rank correlation improvements with human ratings, standard deviations across folds) and a one-sentence summary of the rating protocol. Full error bars, ablation tables, and protocol details already appear in Sections 4–5; the abstract revision will simply surface the most salient numbers without lengthening the paragraph excessively. revision: yes
-
Referee: [Abstract] Abstract: The claim that the 2,619 synthetic meshes and associated human ratings on Geometry and Prompt Adherence constitute a representative ground truth for benchmarking and fine-tuning rests on an unexamined assumption of generalizability. No evidence is supplied regarding coverage of generation artifacts from diverse pipelines, inter-rater agreement statistics, or external validation, which directly affects whether the reported VLM improvements and benchmark status extend beyond this specific collection.
Authors: Section 3 describes the mesh generation sources (multiple open-source pipelines) chosen to sample common artifact types; we will add a short clause in the abstract referencing this diversity. Inter-rater agreement statistics (e.g., Fleiss’ kappa) were computed during data collection and will be reported in the revision. External validation on independent test sets lies outside the current scope, as the work introduces the first dedicated benchmark; we view this as a natural direction for follow-up rather than a requirement for establishing the initial benchmark. revision: partial
Circularity Check
No significant circularity; claims rest on independent human-rated dataset
full rationale
The paper introduces a new dataset of 2,619 synthetic meshes with separately collected human ratings on Geometry and Prompt Adherence, then uses that external data to benchmark VLMs, identify visual encoding as a factor, fine-tune Qwen-2.5-VL-7B, and report outperformance. No load-bearing step reduces by construction to a self-definition, a fitted parameter renamed as a prediction, or a self-citation chain; the derivation chain is self-contained against the collected human judgments and does not invoke uniqueness theorems or ansatzes from prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human ratings on Geometry and Prompt Adherence serve as reliable ground truth for training and evaluating 3D mesh quality models.
Reference graph
Works this paper leans on
-
[1]
Polydiff: Generating 3d polygonal meshes with diffusion models.arXiv preprint arXiv:2312.11417,
Antonio Alliegro, Yawar Siddiqui, Tatiana Tommasi, and Matthias Nießner. Polydiff: Generating 3d polygonal meshes with diffusion models.arXiv preprint arXiv:2312.11417,
-
[2]
Qwen2.5-vl technical report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025. 6
2025
-
[3]
Meta 3d gen, 2024
Raphael Bensadoun, Tom Monnier, Yanir Kleiman, Filippos Kokkinos, Yawar Siddiqui, Mahendra Kariya, Omri Harosh, Roman Shapovalov, Benjamin Graham, Emilien Garreau, Animesh Karnewar, Ang Cao, Idan Azuri, Iurii Makarov, Eric-Tuan Le, Antoine Toisoul, David Novotny, Oran Gafni, Natalia Neverova, and Andrea Vedaldi. Meta 3d gen, 2024. 1
2024
-
[4]
Xiao Cai, Sitong Su, Jingkuan Song, Pengpeng Zeng, Ji Zhang, Qinhong Du, Mengqi Li, Heng Tao Shen, and Lianli Gao. Gt23d-bench: A comprehensive general text-to-3d gen- eration benchmark.arXiv preprint arXiv:2412.09997, 2024. 3
arXiv 2024
-
[5]
Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation
Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22246–22256, 2023. 3
2023
-
[6]
Meshxl: Neural coordinate field for generative 3d foundation models.Advances in Neural Information Pro- cessing Systems, 37:97141–97166, 2024
Sijin Chen, Xin Chen, Anqi Pang, Xianfang Zeng, Wei Cheng, Yijun Fu, Fukun Yin, Billzb Wang, Jingyi Yu, Gang Yu, et al. Meshxl: Neural coordinate field for generative 3d foundation models.Advances in Neural Information Pro- cessing Systems, 37:97141–97166, 2024. 3
2024
-
[7]
Controllable 3d world genera- tion from any input.https://www.csm.ai/blog/ controllable - 3d - world - generation - from - any-input, 2025
Common Sense Machines. Controllable 3d world genera- tion from any input.https://www.csm.ai/blog/ controllable - 3d - world - generation - from - any-input, 2025. 3
2025
-
[8]
Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 6
Pith/arXiv arXiv 2010
-
[9]
Eval3d: Interpretable and fine-grained evaluation for 3d generation
Shivam Duggal, Yushi Hu, Oscar Michel, Aniruddha Kem- bhavi, William T Freeman, Noah A Smith, Ranjay Krishna, Antonio Torralba, Ali Farhadi, and Wei-Chiu Ma. Eval3d: Interpretable and fine-grained evaluation for 3d generation. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 13326–13336, 2025. 3
2025
-
[10]
3d arena: An open platform for generative 3d evaluation.arXiv preprint arXiv:2506.18787, 2025
Dylan Ebert. 3d arena: An open platform for generative 3d evaluation.arXiv preprint arXiv:2506.18787, 2025. 3
arXiv 2025
-
[11]
Yuze He, Yushi Bai, Matthieu Lin, Wang Zhao, Yubin Hu, Jenny Sheng, Ran Yi, Juanzi Li, and Yong-Jin Liu. T3 bench: Benchmarking current progress in text-to-3d gener- ation.arXiv preprint arXiv:2310.02977, 2023. 3
arXiv 2023
-
[12]
Lora: Low-rank adaptation of large language models.ICLR,
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.ICLR,
-
[13]
Shap-e: Generat- ing conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023
Heewoo Jun and Alex Nichol. Shap-e: Generat- ing conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023. 1
Pith/arXiv arXiv 2023
-
[14]
Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation
Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, and Chen Change Loy. Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation. InEuropean Conference on Computer Vision, pages 112–130. Springer, 2024. 3
2024
-
[15]
Calibrated multi-preference optimization for aligning diffusion models
Kyungmin Lee, Xiahong Li, Qifei Wang, Junfeng He, Junjie Ke, Ming-Hsuan Yang, Irfan Essa, Jinwoo Shin, Feng Yang, and Yinxiao Li. Calibrated multi-preference optimization for aligning diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18465– 18475, 2025. 1
2025
-
[16]
Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model
Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023. 3
arXiv 2023
-
[17]
Ruihang Li, Yixuan Wei, Miaosen Zhang, Nenghai Yu, Han Hu, and Houwen Peng. Scalingfilter: Assessing data quality through inverse utilization of scaling laws.arXiv preprint arXiv:2408.08310, 2024. 1
arXiv 2024
-
[18]
Magic3d: High-resolution text-to-3d content creation
Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023. 3
2023
-
[19]
Luma genie: Text-to-3d by luma ai.https: //www.luma-ai.com/text-to-3d/, 2024
Luma Labs. Luma genie: Text-to-3d by luma ai.https: //www.luma-ai.com/text-to-3d/, 2024. 3
2024
-
[20]
Gen3deval: Using vllms for automatic evaluation of gener- ated 3d objects
Shalini Maiti, Lourdes Agapito, and Filippos Kokkinos. Gen3deval: Using vllms for automatic evaluation of gener- ated 3d objects. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18552–18562, 2025. 1, 3
2025
-
[21]
Meshy: Ai-powered 3d model generation
Meshy.ai. Meshy: Ai-powered 3d model generation. https://www.meshy.ai/, 2024. 3
2024
-
[22]
Latent-nerf for shape-guided generation of 3d shapes and textures
Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. Latent-nerf for shape-guided generation of 3d shapes and textures. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12663–12673, 2023. 3
2023
-
[23]
Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 3
2021
-
[24]
Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generat- ing 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022. 3
Pith/arXiv arXiv 2022
-
[25]
Human-centered design recommenda- tions for llm-as-a-judge.arXiv preprint arXiv:2407.03479,
Qian Pan, Zahra Ashktorab, Michael Desmond, Martin San- tillan Cooper, James Johnson, Rahul Nair, Elizabeth Daly, and Werner Geyer. Human-centered design recommenda- tions for llm-as-a-judge.arXiv preprint arXiv:2407.03479,
-
[26]
Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022
Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 3
Pith/arXiv arXiv 2022
-
[27]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 3, 4, 6
2021
-
[28]
Cube: A roblox view of 3d intelligence.arXiv preprint arXiv:2503.15475, 2025
Foundation AI Team Roblox. Cube: A roblox view of 3d intelligence.arXiv preprint arXiv:2503.15475, 2025. 1, 2, 3
arXiv 2025
-
[29]
Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023
Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023. 3
Pith/arXiv arXiv 2023
-
[30]
Meshgpt: Generating triangle meshes with decoder-only transformers
Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Ta- tiana Tommasi, Daniele Sirigatti, Vladislav Rosov, Angela Dai, and Matthias Nießner. Meshgpt: Generating triangle meshes with decoder-only transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19615–19625, 2024. 3
2024
-
[31]
Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett. To cot or not to cot? chain-of-thought helps mainly on math and sym- bolic reasoning.arXiv preprint arXiv:2409.12183, 2024. 5
arXiv 2024
-
[32]
Balanceddpo: Adaptive multi-metric alignment.arXiv preprint arXiv:2503.12575, 2025
Dipesh Tamboli, Souradip Chakraborty, Aditya Malusare, Biplab Banerjee, Amrit Singh Bedi, and Vaneet Aggar- wal. Balanceddpo: Adaptive multi-metric alignment.arXiv preprint arXiv:2503.12575, 2025. 1
Pith/arXiv arXiv 2025
-
[33]
Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for effi- cient 3d content creation.arXiv preprint arXiv:2309.16653,
-
[34]
Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation, 2024
Tencent Hunyuan3D Team. Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation, 2024. 1, 3
2024
-
[35]
Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation, 2025
Tencent Hunyuan3D Team. Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation, 2025
2025
-
[36]
Hunyuan3d 2.5: Towards high- fidelity 3d assets generation with ultimate details, 2025
Tencent Hunyuan3D Team. Hunyuan3d 2.5: Towards high- fidelity 3d assets generation with ultimate details, 2025. 1, 3
2025
-
[37]
Clip-nerf: Text-and-image driven manip- ulation of neural radiance fields
Can Wang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. Clip-nerf: Text-and-image driven manip- ulation of neural radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3835–3844, 2022. 3
2022
-
[38]
Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in Neural Information Processing Systems, 36:8406–8441, 2023
Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in Neural Information Processing Systems, 36:8406–8441, 2023. 3
2023
-
[39]
Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022. 4, 5
2022
-
[40]
Gpt-4v (ision) is a human-aligned evaluator for text-to-3d genera- tion
Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas Guibas, Dahua Lin, and Gordon Wetzstein. Gpt-4v (ision) is a human-aligned evaluator for text-to-3d genera- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 22227–22238,
-
[41]
Text2nerf: Text-driven 3d scene generation with neu- ral radiance fields.IEEE Transactions on Visualization and Computer Graphics, 30(12):7749–7762, 2024
Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, and Jing Liao. Text2nerf: Text-driven 3d scene generation with neu- ral radiance fields.IEEE Transactions on Visualization and Computer Graphics, 30(12):7749–7762, 2024. 3
2024
-
[42]
3dgen-bench: Comprehensive benchmark suite for 3d generative models
Yuhan Zhang, Mengchen Zhang, Tong Wu, Tengfei Wang, Gordon Wetzstein, Dahua Lin, and Ziwei Liu. 3dgen-bench: Comprehensive benchmark suite for 3d generative models. arXiv preprint arXiv:2503.21745, 2025. 1, 3
arXiv 2025
-
[43]
Hi3deval: Ad- vancing 3d generation evaluation with hierarchical validity
Yuhan Zhang, Long Zhuo, Ziyang Chu, Tong Wu, Zhibing Li, Liang Pan, Dahua Lin, and Ziwei Liu. Hi3deval: Ad- vancing 3d generation evaluation with hierarchical validity. arXiv preprint arXiv:2508.05609, 2025. 3
arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.