DB-3DME: From Dataset to Benchmark for Human-aligned Automatic 3D Mesh Evaluation

Jingshen Wang; Nanshan Jia; Sui Huang; Zeyu Zheng; Zhenyu Zhao

arxiv: 2606.10142 · v1 · pith:VR4J7VKDnew · submitted 2026-06-08 · 💻 cs.CV

DB-3DME: From Dataset to Benchmark for Human-aligned Automatic 3D Mesh Evaluation

Nanshan Jia , Zhenyu Zhao , Sui Huang , Jingshen Wang , Zeyu Zheng This is my paper

Pith reviewed 2026-06-27 16:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D mesh evaluationvision-language modelshuman alignmentdataset benchmarkfine-tuninggeometry assessmentprompt adherencesynthetic meshes

0 comments

The pith

Fine-tuning the visual encoder of Qwen-2.5-VL-7B on human-rated 3D meshes produces a model that outperforms pre-trained VLMs in matching human judgments on geometry and prompt adherence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates DB-3DME, a collection of 2,619 synthetic 3D meshes each paired with human ratings on two axes: geometry quality and how well the mesh matches a text prompt. It first tests current vision-language models on this data and finds that how the model encodes 3D visual input determines how closely its scores match people. The authors then adapt only the visual encoder of Qwen-2.5-VL-7B while leaving its language model untouched, producing a new evaluator that scores higher than existing pre-trained models on the same human-aligned dimensions. This matters because 3D generation now moves faster than reliable ways to judge the output, and an automatic metric that tracks human opinion would let researchers iterate without constant human studies. The released dataset and tuned model together form the new reference point for automatic 3D mesh assessment.

Core claim

DB-3DME supplies 2,619 synthetic 3D meshes together with human ratings on Geometry and Prompt Adherence. Systematic benchmarking of state-of-the-art VLMs on this data shows that the visual encoding of 3D representations is the decisive factor for human-aligned evaluation performance. Fine-tuning Qwen-2.5-VL-7B by adapting its visual encoder while freezing the language model yields a model that substantially outperforms existing pre-trained VLMs across multiple evaluation dimensions and thereby establishes a new benchmark for automatic 3D mesh evaluation.

What carries the argument

The DB-3DME dataset of 2,619 synthetic meshes with human ratings on Geometry and Prompt Adherence, used both to diagnose that visual encoding controls alignment and to fine-tune only the visual encoder of Qwen-2.5-VL-7B.

If this is right

Visual encoding of 3D input is the main bottleneck that must be addressed to make VLMs match human 3D-mesh judgments.
Freezing the language model while adapting only the visual encoder is sufficient to obtain large gains on this task.
A public dataset of this size enables direct, reproducible comparison of any future automatic evaluator against human ratings.
The fine-tuned model offers a scalable replacement for human raters when judging large numbers of generated meshes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same visual-encoder adaptation pattern could be tested on other 3D representations such as point clouds or neural radiance fields.
Collecting ratings on real-world rather than synthetic meshes would test whether the current ground truth already captures the full distribution of generation artifacts.
If the benchmark is adopted, 3D generation papers could begin reporting automatic scores alongside or instead of small-scale human studies.

Load-bearing premise

The 2,619 synthetic meshes and their human ratings on geometry and prompt adherence supply a representative ground truth that generalizes to other meshes and prompts.

What would settle it

If the fine-tuned model shows no improvement over pre-trained VLMs when scored against human ratings on a new collection of meshes outside the original 2,619, the claim of a new benchmark would not hold.

Figures

Figures reproduced from arXiv: 2606.10142 by Jingshen Wang, Nanshan Jia, Sui Huang, Zeyu Zheng, Zhenyu Zhao.

**Figure 1.** Figure 1: Workflow of 3D mesh evaluation. Blue icons denote the released 3D mesh dataset, the green icon represents the evaluation [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Prompt distribution of our 3D mesh dataset over object categories. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Example sample from our dataset, showing the 3D mesh rendered as a [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Recent advances in 3D generation have led to substantial improvements in realism, controllability, and efficiency, yet the evaluation of 3D assets remains underexplored. Existing evaluation paradigms, including human evaluation, learned metrics, and vision-language models (VLMs) as judges, suffer from limitations in cost, scalability, resolution handling, or task-specific alignment. In this work, we focus on 3D mesh evaluation and introduce DB-3DME, the Dataset and Benchmark for 3D Mesh Evaluation. DB-3DME contains 2,619 synthetic 3D meshes paired with human ratings on Geometry and Prompt Adherence. Using this dataset, we systematically benchmark state-of-the-art VLMs and identify visual encoding of 3D representations as a key factor for human-aligned evaluation performance. Motivated by this finding, we fine-tune an open-weight VLM, Qwen-2.5-VL-7B, for 3D mesh evaluation by adapting the visual encoder while freezing the language model. The fine-tuned model substantially outperforms existing pre-trained VLMs across multiple evaluation dimensions, establishing a new benchmark for automatic 3D mesh evaluation. We publicly release the benchmark dataset on GitHub and Hugging Face to facilitate future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a new dataset of human-rated synthetic 3D meshes and fine-tunes a VLM's visual encoder on it, but the abstract supplies no numbers so the performance claims stay unverified.

read the letter

This paper gives us a new dataset of 2,619 synthetic 3D meshes with human ratings on geometry and prompt adherence, along with a fine-tuned Qwen-2.5-VL-7B that adapts the visual encoder to better match those ratings. The main takeaway is that visual encoding seems to be the bottleneck for VLM-based 3D mesh evaluation.

They handle the release of the data well and make a reasonable case for why freezing the language model and tuning the vision part is efficient. Spotting the role of visual encoding from their benchmarks is a useful observation for anyone trying to use VLMs on 3D data.

The soft spots stand out because the abstract has zero quantitative results. No scores, no error bars, no protocol for collecting the ratings, so the "substantially outperforms" claim can't be checked. The dataset being only synthetic meshes raises a real question about whether it captures the range of issues in current 3D generators. If the ratings have low agreement or the prompts are biased, the whole benchmark becomes less general.

This is for people in 3D generation and evaluation who need scalable human-aligned metrics. The dataset could help others test their own models. It deserves peer review because the problem is important and the artifacts are new, even though more details on the experiments are needed before it can be fully assessed.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces DB-3DME, a dataset of 2,619 synthetic 3D meshes paired with human ratings on Geometry and Prompt Adherence. It benchmarks state-of-the-art VLMs on this data, identifies visual encoding of 3D representations as a key factor for human-aligned evaluation, fine-tunes Qwen-2.5-VL-7B by adapting the visual encoder while freezing the language model, and claims that the resulting model substantially outperforms existing pre-trained VLMs across multiple dimensions, thereby establishing a new benchmark for automatic 3D mesh evaluation. The dataset is released publicly on GitHub and Hugging Face.

Significance. If the performance claims and generalizability hold, the work would be significant for the 3D generation community by supplying a publicly available human-aligned benchmark and demonstrating a practical fine-tuning approach that improves VLM-based evaluation. The emphasis on visual encoding and the dataset release are concrete contributions that could support reproducible progress in scalable 3D asset assessment.

major comments (2)

[Abstract] Abstract: The central claim that the fine-tuned Qwen-2.5-VL-7B 'substantially outperforms existing pre-trained VLMs across multiple evaluation dimensions' is presented without any quantitative results, error bars, ablation details, or description of the rating collection protocol. This absence makes the data-to-claim link unverifiable and is load-bearing for the assertion that a new benchmark has been established.
[Abstract] Abstract: The claim that the 2,619 synthetic meshes and associated human ratings on Geometry and Prompt Adherence constitute a representative ground truth for benchmarking and fine-tuning rests on an unexamined assumption of generalizability. No evidence is supplied regarding coverage of generation artifacts from diverse pipelines, inter-rater agreement statistics, or external validation, which directly affects whether the reported VLM improvements and benchmark status extend beyond this specific collection.

minor comments (1)

The abstract would be strengthened by including at least one key quantitative comparison (e.g., correlation or accuracy delta) to allow readers to assess the performance claim immediately.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the two major comments point-by-point below and will revise the manuscript accordingly where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the fine-tuned Qwen-2.5-VL-7B 'substantially outperforms existing pre-trained VLMs across multiple evaluation dimensions' is presented without any quantitative results, error bars, ablation details, or description of the rating collection protocol. This absence makes the data-to-claim link unverifiable and is load-bearing for the assertion that a new benchmark has been established.

Authors: We agree that the abstract would be strengthened by including concrete quantitative support. In the revised version we will insert key metrics (e.g., Spearman rank correlation improvements with human ratings, standard deviations across folds) and a one-sentence summary of the rating protocol. Full error bars, ablation tables, and protocol details already appear in Sections 4–5; the abstract revision will simply surface the most salient numbers without lengthening the paragraph excessively. revision: yes
Referee: [Abstract] Abstract: The claim that the 2,619 synthetic meshes and associated human ratings on Geometry and Prompt Adherence constitute a representative ground truth for benchmarking and fine-tuning rests on an unexamined assumption of generalizability. No evidence is supplied regarding coverage of generation artifacts from diverse pipelines, inter-rater agreement statistics, or external validation, which directly affects whether the reported VLM improvements and benchmark status extend beyond this specific collection.

Authors: Section 3 describes the mesh generation sources (multiple open-source pipelines) chosen to sample common artifact types; we will add a short clause in the abstract referencing this diversity. Inter-rater agreement statistics (e.g., Fleiss’ kappa) were computed during data collection and will be reported in the revision. External validation on independent test sets lies outside the current scope, as the work introduces the first dedicated benchmark; we view this as a natural direction for follow-up rather than a requirement for establishing the initial benchmark. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent human-rated dataset

full rationale

The paper introduces a new dataset of 2,619 synthetic meshes with separately collected human ratings on Geometry and Prompt Adherence, then uses that external data to benchmark VLMs, identify visual encoding as a factor, fine-tune Qwen-2.5-VL-7B, and report outperformance. No load-bearing step reduces by construction to a self-definition, a fitted parameter renamed as a prediction, or a self-citation chain; the derivation chain is self-contained against the collected human judgments and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that human ratings collected for the synthetic meshes constitute valid ground truth; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Human ratings on Geometry and Prompt Adherence serve as reliable ground truth for training and evaluating 3D mesh quality models.
The benchmarking and fine-tuning results are measured against these ratings.

pith-pipeline@v0.9.1-grok · 5772 in / 1310 out tokens · 22622 ms · 2026-06-27T16:44:31.306233+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 7 linked inside Pith

[1]

Polydiff: Generating 3d polygonal meshes with diffusion models.arXiv preprint arXiv:2312.11417,

Antonio Alliegro, Yawar Siddiqui, Tatiana Tommasi, and Matthias Nießner. Polydiff: Generating 3d polygonal meshes with diffusion models.arXiv preprint arXiv:2312.11417,

arXiv
[2]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025. 6

2025
[3]

Meta 3d gen, 2024

Raphael Bensadoun, Tom Monnier, Yanir Kleiman, Filippos Kokkinos, Yawar Siddiqui, Mahendra Kariya, Omri Harosh, Roman Shapovalov, Benjamin Graham, Emilien Garreau, Animesh Karnewar, Ang Cao, Idan Azuri, Iurii Makarov, Eric-Tuan Le, Antoine Toisoul, David Novotny, Oran Gafni, Natalia Neverova, and Andrea Vedaldi. Meta 3d gen, 2024. 1

2024
[4]

Gt23d-bench: A comprehensive general text-to-3d gen- eration benchmark.arXiv preprint arXiv:2412.09997, 2024

Xiao Cai, Sitong Su, Jingkuan Song, Pengpeng Zeng, Ji Zhang, Qinhong Du, Mengqi Li, Heng Tao Shen, and Lianli Gao. Gt23d-bench: A comprehensive general text-to-3d gen- eration benchmark.arXiv preprint arXiv:2412.09997, 2024. 3

arXiv 2024
[5]

Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation

Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22246–22256, 2023. 3

2023
[6]

Meshxl: Neural coordinate field for generative 3d foundation models.Advances in Neural Information Pro- cessing Systems, 37:97141–97166, 2024

Sijin Chen, Xin Chen, Anqi Pang, Xianfang Zeng, Wei Cheng, Yijun Fu, Fukun Yin, Billzb Wang, Jingyi Yu, Gang Yu, et al. Meshxl: Neural coordinate field for generative 3d foundation models.Advances in Neural Information Pro- cessing Systems, 37:97141–97166, 2024. 3

2024
[7]

Controllable 3d world genera- tion from any input.https://www.csm.ai/blog/ controllable - 3d - world - generation - from - any-input, 2025

Common Sense Machines. Controllable 3d world genera- tion from any input.https://www.csm.ai/blog/ controllable - 3d - world - generation - from - any-input, 2025. 3

2025
[8]

An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 6

Pith/arXiv arXiv 2010
[9]

Eval3d: Interpretable and fine-grained evaluation for 3d generation

Shivam Duggal, Yushi Hu, Oscar Michel, Aniruddha Kem- bhavi, William T Freeman, Noah A Smith, Ranjay Krishna, Antonio Torralba, Ali Farhadi, and Wei-Chiu Ma. Eval3d: Interpretable and fine-grained evaluation for 3d generation. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 13326–13336, 2025. 3

2025
[10]

3d arena: An open platform for generative 3d evaluation.arXiv preprint arXiv:2506.18787, 2025

Dylan Ebert. 3d arena: An open platform for generative 3d evaluation.arXiv preprint arXiv:2506.18787, 2025. 3

arXiv 2025
[11]

T3 bench: Benchmarking current progress in text-to-3d gener- ation.arXiv preprint arXiv:2310.02977, 2023

Yuze He, Yushi Bai, Matthieu Lin, Wang Zhao, Yubin Hu, Jenny Sheng, Ran Yi, Juanzi Li, and Yong-Jin Liu. T3 bench: Benchmarking current progress in text-to-3d gener- ation.arXiv preprint arXiv:2310.02977, 2023. 3

arXiv 2023
[12]

Lora: Low-rank adaptation of large language models.ICLR,

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.ICLR,
[13]

Shap-e: Generat- ing conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023

Heewoo Jun and Alex Nichol. Shap-e: Generat- ing conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023. 1

Pith/arXiv arXiv 2023
[14]

Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation

Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, and Chen Change Loy. Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation. InEuropean Conference on Computer Vision, pages 112–130. Springer, 2024. 3

2024
[15]

Calibrated multi-preference optimization for aligning diffusion models

Kyungmin Lee, Xiahong Li, Qifei Wang, Junfeng He, Junjie Ke, Ming-Hsuan Yang, Irfan Essa, Jinwoo Shin, Feng Yang, and Yinxiao Li. Calibrated multi-preference optimization for aligning diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18465– 18475, 2025. 1

2025
[16]

Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model

Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023. 3

arXiv 2023
[17]

Scalingfilter: Assessing data quality through inverse utilization of scaling laws.arXiv preprint arXiv:2408.08310, 2024

Ruihang Li, Yixuan Wei, Miaosen Zhang, Nenghai Yu, Han Hu, and Houwen Peng. Scalingfilter: Assessing data quality through inverse utilization of scaling laws.arXiv preprint arXiv:2408.08310, 2024. 1

arXiv 2024
[18]

Magic3d: High-resolution text-to-3d content creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023. 3

2023
[19]

Luma genie: Text-to-3d by luma ai.https: //www.luma-ai.com/text-to-3d/, 2024

Luma Labs. Luma genie: Text-to-3d by luma ai.https: //www.luma-ai.com/text-to-3d/, 2024. 3

2024
[20]

Gen3deval: Using vllms for automatic evaluation of gener- ated 3d objects

Shalini Maiti, Lourdes Agapito, and Filippos Kokkinos. Gen3deval: Using vllms for automatic evaluation of gener- ated 3d objects. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18552–18562, 2025. 1, 3

2025
[21]

Meshy: Ai-powered 3d model generation

Meshy.ai. Meshy: Ai-powered 3d model generation. https://www.meshy.ai/, 2024. 3

2024
[22]

Latent-nerf for shape-guided generation of 3d shapes and textures

Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. Latent-nerf for shape-guided generation of 3d shapes and textures. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12663–12673, 2023. 3

2023
[23]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 3

2021
[24]

Point-e: A system for generat- ing 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022

Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generat- ing 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022. 3

Pith/arXiv arXiv 2022
[25]

Human-centered design recommenda- tions for llm-as-a-judge.arXiv preprint arXiv:2407.03479,

Qian Pan, Zahra Ashktorab, Michael Desmond, Martin San- tillan Cooper, James Johnson, Rahul Nair, Elizabeth Daly, and Werner Geyer. Human-centered design recommenda- tions for llm-as-a-judge.arXiv preprint arXiv:2407.03479,

arXiv
[26]

Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 3

Pith/arXiv arXiv 2022
[27]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 3, 4, 6

2021
[28]

Cube: A roblox view of 3d intelligence.arXiv preprint arXiv:2503.15475, 2025

Foundation AI Team Roblox. Cube: A roblox view of 3d intelligence.arXiv preprint arXiv:2503.15475, 2025. 1, 2, 3

arXiv 2025
[29]

Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023. 3

Pith/arXiv arXiv 2023
[30]

Meshgpt: Generating triangle meshes with decoder-only transformers

Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Ta- tiana Tommasi, Daniele Sirigatti, Vladislav Rosov, Angela Dai, and Matthias Nießner. Meshgpt: Generating triangle meshes with decoder-only transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19615–19625, 2024. 3

2024
[31]

To cot or not to cot? chain-of-thought helps mainly on math and sym- bolic reasoning.arXiv preprint arXiv:2409.12183, 2024

Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett. To cot or not to cot? chain-of-thought helps mainly on math and sym- bolic reasoning.arXiv preprint arXiv:2409.12183, 2024. 5

arXiv 2024
[32]

Balanceddpo: Adaptive multi-metric alignment.arXiv preprint arXiv:2503.12575, 2025

Dipesh Tamboli, Souradip Chakraborty, Aditya Malusare, Biplab Banerjee, Amrit Singh Bedi, and Vaneet Aggar- wal. Balanceddpo: Adaptive multi-metric alignment.arXiv preprint arXiv:2503.12575, 2025. 1

Pith/arXiv arXiv 2025
[33]

Dreamgaussian: Generative gaussian splatting for effi- cient 3d content creation.arXiv preprint arXiv:2309.16653,

Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for effi- cient 3d content creation.arXiv preprint arXiv:2309.16653,

Pith/arXiv arXiv
[34]

Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation, 2024

Tencent Hunyuan3D Team. Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation, 2024. 1, 3

2024
[35]

Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation, 2025

Tencent Hunyuan3D Team. Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation, 2025

2025
[36]

Hunyuan3d 2.5: Towards high- fidelity 3d assets generation with ultimate details, 2025

Tencent Hunyuan3D Team. Hunyuan3d 2.5: Towards high- fidelity 3d assets generation with ultimate details, 2025. 1, 3

2025
[37]

Clip-nerf: Text-and-image driven manip- ulation of neural radiance fields

Can Wang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. Clip-nerf: Text-and-image driven manip- ulation of neural radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3835–3844, 2022. 3

2022
[38]

Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in Neural Information Processing Systems, 36:8406–8441, 2023

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in Neural Information Processing Systems, 36:8406–8441, 2023. 3

2023
[39]

Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022. 4, 5

2022
[40]

Gpt-4v (ision) is a human-aligned evaluator for text-to-3d genera- tion

Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas Guibas, Dahua Lin, and Gordon Wetzstein. Gpt-4v (ision) is a human-aligned evaluator for text-to-3d genera- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 22227–22238,
[41]

Text2nerf: Text-driven 3d scene generation with neu- ral radiance fields.IEEE Transactions on Visualization and Computer Graphics, 30(12):7749–7762, 2024

Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, and Jing Liao. Text2nerf: Text-driven 3d scene generation with neu- ral radiance fields.IEEE Transactions on Visualization and Computer Graphics, 30(12):7749–7762, 2024. 3

2024
[42]

3dgen-bench: Comprehensive benchmark suite for 3d generative models

Yuhan Zhang, Mengchen Zhang, Tong Wu, Tengfei Wang, Gordon Wetzstein, Dahua Lin, and Ziwei Liu. 3dgen-bench: Comprehensive benchmark suite for 3d generative models. arXiv preprint arXiv:2503.21745, 2025. 1, 3

arXiv 2025
[43]

Hi3deval: Ad- vancing 3d generation evaluation with hierarchical validity

Yuhan Zhang, Long Zhuo, Ziyang Chu, Tong Wu, Zhibing Li, Liang Pan, Dahua Lin, and Ziwei Liu. Hi3deval: Ad- vancing 3d generation evaluation with hierarchical validity. arXiv preprint arXiv:2508.05609, 2025. 3

arXiv 2025

[1] [1]

Polydiff: Generating 3d polygonal meshes with diffusion models.arXiv preprint arXiv:2312.11417,

Antonio Alliegro, Yawar Siddiqui, Tatiana Tommasi, and Matthias Nießner. Polydiff: Generating 3d polygonal meshes with diffusion models.arXiv preprint arXiv:2312.11417,

arXiv

[2] [2]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025. 6

2025

[3] [3]

Meta 3d gen, 2024

Raphael Bensadoun, Tom Monnier, Yanir Kleiman, Filippos Kokkinos, Yawar Siddiqui, Mahendra Kariya, Omri Harosh, Roman Shapovalov, Benjamin Graham, Emilien Garreau, Animesh Karnewar, Ang Cao, Idan Azuri, Iurii Makarov, Eric-Tuan Le, Antoine Toisoul, David Novotny, Oran Gafni, Natalia Neverova, and Andrea Vedaldi. Meta 3d gen, 2024. 1

2024

[4] [4]

Gt23d-bench: A comprehensive general text-to-3d gen- eration benchmark.arXiv preprint arXiv:2412.09997, 2024

Xiao Cai, Sitong Su, Jingkuan Song, Pengpeng Zeng, Ji Zhang, Qinhong Du, Mengqi Li, Heng Tao Shen, and Lianli Gao. Gt23d-bench: A comprehensive general text-to-3d gen- eration benchmark.arXiv preprint arXiv:2412.09997, 2024. 3

arXiv 2024

[5] [5]

Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation

Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22246–22256, 2023. 3

2023

[6] [6]

Meshxl: Neural coordinate field for generative 3d foundation models.Advances in Neural Information Pro- cessing Systems, 37:97141–97166, 2024

Sijin Chen, Xin Chen, Anqi Pang, Xianfang Zeng, Wei Cheng, Yijun Fu, Fukun Yin, Billzb Wang, Jingyi Yu, Gang Yu, et al. Meshxl: Neural coordinate field for generative 3d foundation models.Advances in Neural Information Pro- cessing Systems, 37:97141–97166, 2024. 3

2024

[7] [7]

Controllable 3d world genera- tion from any input.https://www.csm.ai/blog/ controllable - 3d - world - generation - from - any-input, 2025

Common Sense Machines. Controllable 3d world genera- tion from any input.https://www.csm.ai/blog/ controllable - 3d - world - generation - from - any-input, 2025. 3

2025

[8] [8]

An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 6

Pith/arXiv arXiv 2010

[9] [9]

Eval3d: Interpretable and fine-grained evaluation for 3d generation

Shivam Duggal, Yushi Hu, Oscar Michel, Aniruddha Kem- bhavi, William T Freeman, Noah A Smith, Ranjay Krishna, Antonio Torralba, Ali Farhadi, and Wei-Chiu Ma. Eval3d: Interpretable and fine-grained evaluation for 3d generation. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 13326–13336, 2025. 3

2025

[10] [10]

3d arena: An open platform for generative 3d evaluation.arXiv preprint arXiv:2506.18787, 2025

Dylan Ebert. 3d arena: An open platform for generative 3d evaluation.arXiv preprint arXiv:2506.18787, 2025. 3

arXiv 2025

[11] [11]

T3 bench: Benchmarking current progress in text-to-3d gener- ation.arXiv preprint arXiv:2310.02977, 2023

Yuze He, Yushi Bai, Matthieu Lin, Wang Zhao, Yubin Hu, Jenny Sheng, Ran Yi, Juanzi Li, and Yong-Jin Liu. T3 bench: Benchmarking current progress in text-to-3d gener- ation.arXiv preprint arXiv:2310.02977, 2023. 3

arXiv 2023

[12] [12]

Lora: Low-rank adaptation of large language models.ICLR,

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.ICLR,

[13] [13]

Shap-e: Generat- ing conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023

Heewoo Jun and Alex Nichol. Shap-e: Generat- ing conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023. 1

Pith/arXiv arXiv 2023

[14] [14]

Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation

Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, and Chen Change Loy. Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation. InEuropean Conference on Computer Vision, pages 112–130. Springer, 2024. 3

2024

[15] [15]

Calibrated multi-preference optimization for aligning diffusion models

Kyungmin Lee, Xiahong Li, Qifei Wang, Junfeng He, Junjie Ke, Ming-Hsuan Yang, Irfan Essa, Jinwoo Shin, Feng Yang, and Yinxiao Li. Calibrated multi-preference optimization for aligning diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18465– 18475, 2025. 1

2025

[16] [16]

Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model

Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023. 3

arXiv 2023

[17] [17]

Scalingfilter: Assessing data quality through inverse utilization of scaling laws.arXiv preprint arXiv:2408.08310, 2024

Ruihang Li, Yixuan Wei, Miaosen Zhang, Nenghai Yu, Han Hu, and Houwen Peng. Scalingfilter: Assessing data quality through inverse utilization of scaling laws.arXiv preprint arXiv:2408.08310, 2024. 1

arXiv 2024

[18] [18]

Magic3d: High-resolution text-to-3d content creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023. 3

2023

[19] [19]

Luma genie: Text-to-3d by luma ai.https: //www.luma-ai.com/text-to-3d/, 2024

Luma Labs. Luma genie: Text-to-3d by luma ai.https: //www.luma-ai.com/text-to-3d/, 2024. 3

2024

[20] [20]

Gen3deval: Using vllms for automatic evaluation of gener- ated 3d objects

Shalini Maiti, Lourdes Agapito, and Filippos Kokkinos. Gen3deval: Using vllms for automatic evaluation of gener- ated 3d objects. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18552–18562, 2025. 1, 3

2025

[21] [21]

Meshy: Ai-powered 3d model generation

Meshy.ai. Meshy: Ai-powered 3d model generation. https://www.meshy.ai/, 2024. 3

2024

[22] [22]

Latent-nerf for shape-guided generation of 3d shapes and textures

Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. Latent-nerf for shape-guided generation of 3d shapes and textures. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12663–12673, 2023. 3

2023

[23] [23]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 3

2021

[24] [24]

Point-e: A system for generat- ing 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022

Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generat- ing 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022. 3

Pith/arXiv arXiv 2022

[25] [25]

Human-centered design recommenda- tions for llm-as-a-judge.arXiv preprint arXiv:2407.03479,

Qian Pan, Zahra Ashktorab, Michael Desmond, Martin San- tillan Cooper, James Johnson, Rahul Nair, Elizabeth Daly, and Werner Geyer. Human-centered design recommenda- tions for llm-as-a-judge.arXiv preprint arXiv:2407.03479,

arXiv

[26] [26]

Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 3

Pith/arXiv arXiv 2022

[27] [27]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 3, 4, 6

2021

[28] [28]

Cube: A roblox view of 3d intelligence.arXiv preprint arXiv:2503.15475, 2025

Foundation AI Team Roblox. Cube: A roblox view of 3d intelligence.arXiv preprint arXiv:2503.15475, 2025. 1, 2, 3

arXiv 2025

[29] [29]

Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023. 3

Pith/arXiv arXiv 2023

[30] [30]

Meshgpt: Generating triangle meshes with decoder-only transformers

Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Ta- tiana Tommasi, Daniele Sirigatti, Vladislav Rosov, Angela Dai, and Matthias Nießner. Meshgpt: Generating triangle meshes with decoder-only transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19615–19625, 2024. 3

2024

[31] [31]

To cot or not to cot? chain-of-thought helps mainly on math and sym- bolic reasoning.arXiv preprint arXiv:2409.12183, 2024

Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett. To cot or not to cot? chain-of-thought helps mainly on math and sym- bolic reasoning.arXiv preprint arXiv:2409.12183, 2024. 5

arXiv 2024

[32] [32]

Balanceddpo: Adaptive multi-metric alignment.arXiv preprint arXiv:2503.12575, 2025

Dipesh Tamboli, Souradip Chakraborty, Aditya Malusare, Biplab Banerjee, Amrit Singh Bedi, and Vaneet Aggar- wal. Balanceddpo: Adaptive multi-metric alignment.arXiv preprint arXiv:2503.12575, 2025. 1

Pith/arXiv arXiv 2025

[33] [33]

Dreamgaussian: Generative gaussian splatting for effi- cient 3d content creation.arXiv preprint arXiv:2309.16653,

Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for effi- cient 3d content creation.arXiv preprint arXiv:2309.16653,

Pith/arXiv arXiv

[34] [34]

Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation, 2024

Tencent Hunyuan3D Team. Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation, 2024. 1, 3

2024

[35] [35]

Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation, 2025

Tencent Hunyuan3D Team. Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation, 2025

2025

[36] [36]

Hunyuan3d 2.5: Towards high- fidelity 3d assets generation with ultimate details, 2025

Tencent Hunyuan3D Team. Hunyuan3d 2.5: Towards high- fidelity 3d assets generation with ultimate details, 2025. 1, 3

2025

[37] [37]

Clip-nerf: Text-and-image driven manip- ulation of neural radiance fields

Can Wang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. Clip-nerf: Text-and-image driven manip- ulation of neural radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3835–3844, 2022. 3

2022

[38] [38]

Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in Neural Information Processing Systems, 36:8406–8441, 2023

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in Neural Information Processing Systems, 36:8406–8441, 2023. 3

2023

[39] [39]

Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022. 4, 5

2022

[40] [40]

Gpt-4v (ision) is a human-aligned evaluator for text-to-3d genera- tion

Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas Guibas, Dahua Lin, and Gordon Wetzstein. Gpt-4v (ision) is a human-aligned evaluator for text-to-3d genera- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 22227–22238,

[41] [41]

Text2nerf: Text-driven 3d scene generation with neu- ral radiance fields.IEEE Transactions on Visualization and Computer Graphics, 30(12):7749–7762, 2024

Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, and Jing Liao. Text2nerf: Text-driven 3d scene generation with neu- ral radiance fields.IEEE Transactions on Visualization and Computer Graphics, 30(12):7749–7762, 2024. 3

2024

[42] [42]

3dgen-bench: Comprehensive benchmark suite for 3d generative models

Yuhan Zhang, Mengchen Zhang, Tong Wu, Tengfei Wang, Gordon Wetzstein, Dahua Lin, and Ziwei Liu. 3dgen-bench: Comprehensive benchmark suite for 3d generative models. arXiv preprint arXiv:2503.21745, 2025. 1, 3

arXiv 2025

[43] [43]

Hi3deval: Ad- vancing 3d generation evaluation with hierarchical validity

Yuhan Zhang, Long Zhuo, Ziyang Chu, Tong Wu, Zhibing Li, Liang Pan, Dahua Lin, and Ziwei Liu. Hi3deval: Ad- vancing 3d generation evaluation with hierarchical validity. arXiv preprint arXiv:2508.05609, 2025. 3

arXiv 2025