ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation

Chunchao Guo; Hanxiao Sun; Hongbo Fu; Mingxin Yang; Shuhui Yang; Wenhan Luo; Xintong Han; Zebin He

arxiv: 2605.21121 · v1 · pith:SCQMSVC6new · submitted 2026-05-20 · 💻 cs.CV · cs.GR

ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation

Hanxiao Sun , Mingxin Yang , Shuhui Yang , Zebin He , Xintong Han , Hongbo Fu , Chunchao Guo , Wenhan Luo This is my paper

Pith reviewed 2026-05-21 05:00 UTC · model grok-4.3

classification 💻 cs.CV cs.GR

keywords 3D generationmulti-view conditioningsingle-view pretrained modelstoken-wise routingdual-stream attentionunposed imagesarbitrary viewsgenerative models

0 comments

The pith

A token-wise router and dual-stream attention upgrade pretrained single-view 3D models to accept arbitrary unposed images and raise output quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that pretrained single-view 3D generative models already hold reusable 2D-to-3D grounding that can be extended to multiple inputs. The central insight is that the model's conditioning mechanism mixes orientation control with geometry transfer, so these functions must be separated to avoid conflicts when combining views from different angles. ROAR-3D introduces a lightweight token-wise view router that matches each 3D latent token to its most relevant input image and a dual-stream attention design that keeps the original primary-view behavior while routing auxiliary views through a dedicated path for geometric detail. An orientation perturbation step during training ensures the auxiliary path learns view-independent geometry. The result is higher-fidelity 3D generation that improves as more unposed images are added at test time with almost no extra cost.

Core claim

ROAR-3D upgrades a pretrained single-view model to accept an arbitrary number of unposed images. A token-wise view router assigns each 3D latent token to its most relevant view, implicitly establishing 2D-to-3D correspondences without explicit pose input. A dual-stream attention design preserves the pretrained primary-view behavior while routing auxiliary views through a separate path dedicated to geometric enrichment. An orientation perturbation strategy ensures the auxiliary path learns orientation-independent geometry transfer. These components introduce minimal trainable parameters and add negligible inference overhead relative to the single-view baseline.

What carries the argument

Token-wise view router paired with dual-stream attention, which assigns each 3D latent token to its most relevant input view and routes auxiliary views through a separate path while preserving primary-view behavior.

If this is right

Multi-view 3D generation quality reaches state-of-the-art levels compared with prior methods that require fixed views or heavy external modules.
Generation quality improves consistently when the number of input views is scaled from 1 to 12 or more at test time.
Only minimal additional parameters are introduced and inference overhead stays negligible relative to the single-view baseline.
The model operates on arbitrary unposed images without needing explicit camera poses or external reconstruction steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The routing approach could extend to combining inputs from different modalities such as text descriptions or depth maps in the same pretrained backbone.
Test-time view scaling implies that applications with many casual photos of an object could achieve higher fidelity simply by feeding all available images without retraining.
Implicit 2D-to-3D correspondence via token routing may lower the need for explicit pose estimation or multi-view supervision in related reconstruction pipelines.

Load-bearing premise

The orientation control and geometry transfer functions inside a pretrained single-view 3D model can be cleanly separated by routing and dual attention without retraining the core model or losing original performance.

What would settle it

An ablation that adds extra views but removes the token-wise router or dual-stream attention and shows no quality gain or even degradation would falsify the claim that these components enable effective separation and reuse.

Figures

Figures reproduced from arXiv: 2605.21121 by Chunchao Guo, Hanxiao Sun, Hongbo Fu, Mingxin Yang, Shuhui Yang, Wenhan Luo, Xintong Han, Zebin He.

**Figure 1.** Figure 1: Professional-grade 3D assets generated by ROAR-3D from arbitrary collections of concept art, design sketches, and video frames, exhibiting hand-crafted fidelity. Abstract. Single-image-to-3D generative models can now produce highquality geometry, yet conditioning on a single view inevitably introduces ambiguity about unseen regions. Multi-view conditioning can reduce this ambiguity, but existing methods … view at source ↗

**Figure 2.** Figure 2: An overview illustration of the proposed ROAR-3D framework, which seamlessly integrates supplementary visual cues from arbitrary unposed views with pretrained single-view generative priors via a lightweight token-wise routing mechanism for high-fidelity 3D reconstruction. and maintain the same inference cost as the single-view baseline, while enabling the model to dynamically fuse information from any nu… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison with baseline methods. ROAR-3D produces complete, high-fidelity 3D shapes that are consistent with all input views [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation study on the three key components of ROAR-3D. Removing any single component leads to incorrect orientation or degraded geometry (highlighted with red boxes). Best viewed when zoomed in [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of varying the number of input views. As more views are provided, the generated geometry becomes more complete and structurally accurate, particularly in regions initially unobserved. The model generalizes robustly beyond the training range (1–4 views) to 8 views. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Effect of providing input views at different zoom levels. Close-up crops of specific regions enhance local geometric detail without affecting global shape. D Ablation on Geometric Refinement As shown in [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation on geometric refinement. Stage 1 produces multi-view-consistent geometry, while stage 2 enhances fine-grained surface detail. Results span diverse input types including objects, architecture, human figures, and line-art sketches [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

read the original abstract

Single-image-to-3D generative models can now produce high-quality geometry, yet conditioning on a single view inevitably introduces ambiguity about unseen regions. Multi-view conditioning can reduce this ambiguity, but existing methods either require fixed canonical viewpoints or rely on external reconstruction modules that impose heavy training costs and limit generation quality. We observe that pretrained single-view models already possess strong 2D-to-3D grounding that can be reused for multi-view conditioning. However, a closer analysis reveals that their conditioning mechanism entangles orientation control with geometry transfer, two functions that conflict when images from different viewpoints are naively combined. Based on this analysis, we propose ROAR-3D, a lightweight method that upgrades a pretrained single-view model to accept an arbitrary number of unposed images. A token-wise view router assigns each 3D latent token to its most relevant view, implicitly establishing 2D-to-3D correspondences without explicit pose input. A dual-stream attention design preserves the pretrained primary-view behavior while routing auxiliary views through a separate path dedicated to geometric enrichment. An orientation perturbation strategy ensures the auxiliary path learns orientation-independent geometry transfer. These components introduce minimal trainable parameters and add negligible inference overhead relative to the single-view baseline. ROAR-3D achieves state-of-the-art multi-view 3D generation quality and supports test-time view scaling from 1 to 12+ views with consistent improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ROAR-3D adds a token router and dual-stream attention to let pretrained single-view 3D models handle arbitrary unposed views with low overhead, but the abstract gives no numbers or ablations to show whether it actually delivers better results.

read the letter

The main thing to know about ROAR-3D is that it proposes a lightweight upgrade to pretrained single-view 3D models so they can take any number of unposed images by using a token-wise router and dual-stream attention. The approach is new in how it separates the functions: the router picks the best view for each 3D latent token without pose info, the dual stream keeps the primary view's original behavior while auxiliaries enrich geometry, and orientation perturbation on auxiliaries makes their path focus on shape rather than angle. This addresses the conflict they see when just combining views naively. It adds minimal parameters and supports adding more views at test time for better results. The paper does well by building on existing strong models instead of starting over or using heavy reconstruction add-ons. The idea is practical for cases with varying numbers of input photos. Soft spots include the absence of any tables, ablations, or error analysis in the provided abstract, so the SOTA claim and consistent improvements are not yet backed by visible data. The central assumption that the pretrained model's latents allow clean separation of orientation and geometry for the router to work without issues may not always hold, especially with distant or similar views, which could cause routing errors or inconsistencies when scaling to 12 views. The stress-test concern seems relevant based on the abstract alone. Readers working on efficient 3D generation from casual multi-view inputs would get the most from this if the full results confirm the gains. It could be worth a look for graphics pipelines in AR or robotics. I think the work deserves a serious referee to evaluate the experiments and whether the routing mechanism performs as intended. Recommendation: Send it for peer review after checking the full manuscript for solid evidence.

Referee Report

2 major / 1 minor

Summary. The paper introduces ROAR-3D, a lightweight method to upgrade pretrained single-view 3D generative models to handle an arbitrary number of unposed images for high-fidelity 3D generation. It uses a token-wise view router to assign 3D latent tokens to the most relevant view without explicit pose input, a dual-stream attention design to preserve primary-view behavior while routing auxiliary views for geometric enrichment, and an orientation perturbation strategy to ensure the auxiliary path learns orientation-independent geometry transfer. The method claims to achieve state-of-the-art multi-view 3D generation quality, support test-time view scaling from 1 to 12+ views with consistent improvements, and introduce minimal trainable parameters with negligible inference overhead relative to the single-view baseline.

Significance. If the central claims hold under empirical validation, this work would be significant for enabling flexible multi-view conditioning in 3D generation by reusing strong 2D-to-3D grounding from existing pretrained models, avoiding the need for fixed canonical viewpoints or costly external reconstruction modules. The lightweight design and test-time scalability represent practical advances that could improve generation quality for arbitrary input views with low overhead.

major comments (2)

[Abstract and Method Description] The central claim that a token-wise router and dual-stream attention can cleanly separate orientation control from geometry transfer (without retraining the core model) is load-bearing for the entire method and its ability to scale to 12+ views. The analysis of entanglement in the abstract and method description does not demonstrate that the pretrained latent space already encodes view-specific relevance distinctly enough to prevent ambiguous routing decisions or leakage of orientation cues into the geometry stream for distant or similar views.
[Abstract and Experiments] The abstract states that ROAR-3D achieves SOTA multi-view 3D generation quality and consistent improvements with view scaling, but the provided description contains no quantitative tables, ablation studies, or error analysis to support these results. This is load-bearing because the effectiveness of the router, dual-stream attention, and perturbation strategy cannot be assessed without such evidence.

minor comments (1)

[Method] The notation for the token-wise router and dual-stream attention could be clarified with explicit equations or pseudocode to make the separation of primary and auxiliary paths easier to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below in a point-by-point fashion and indicate the changes made to strengthen the work.

read point-by-point responses

Referee: [Abstract and Method Description] The central claim that a token-wise router and dual-stream attention can cleanly separate orientation control from geometry transfer (without retraining the core model) is load-bearing for the entire method and its ability to scale to 12+ views. The analysis of entanglement in the abstract and method description does not demonstrate that the pretrained latent space already encodes view-specific relevance distinctly enough to prevent ambiguous routing decisions or leakage of orientation cues into the geometry stream for distant or similar views.

Authors: We agree that a clear demonstration of this separation is essential for the method's validity and scalability. Section 3 of the full manuscript analyzes the entanglement through both theoretical motivation and empirical observations, including attention visualizations showing how naive fusion mixes orientation and geometry signals. The token-wise router computes relevance scores directly from the pretrained latents to assign tokens, while the dual-stream attention and orientation perturbation explicitly isolate orientation control to the primary view. To address the concern about ambiguous routing or leakage for similar or distant views, we have added new quantitative routing accuracy metrics and additional attention visualizations in the revised manuscript and supplementary material. revision: yes
Referee: [Abstract and Experiments] The abstract states that ROAR-3D achieves SOTA multi-view 3D generation quality and consistent improvements with view scaling, but the provided description contains no quantitative tables, ablation studies, or error analysis to support these results. This is load-bearing because the effectiveness of the router, dual-stream attention, and perturbation strategy cannot be assessed without such evidence.

Authors: The full manuscript contains the requested evidence: Table 1 reports quantitative SOTA comparisons on standard 3D generation benchmarks, Table 2 and Section 4.3 present ablations isolating the router, dual-stream attention, and perturbation strategy, and Figure 4 shows consistent quality gains when scaling from 1 to 12+ views at test time. We have expanded the error analysis and failure-case discussion in the revision. While the abstract is necessarily concise, we have updated it to explicitly reference these supporting results and sections. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation; method builds on external pretrained models

full rationale

The paper describes an empirical method that reuses pretrained single-view 3D models via added components (token-wise router, dual-stream attention, orientation perturbation). No equations, derivations, or fitted parameters are presented that reduce the claimed multi-view improvements or view-scaling behavior to quantities defined by the method itself. The analysis of entanglement is observational, and performance claims are positioned as empirical outcomes on external benchmarks rather than self-referential predictions. No self-citation chains or uniqueness theorems are invoked as load-bearing. The derivation chain is self-contained against external pretrained models and test-time scaling results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that pretrained single-view models already encode reusable 2D-to-3D correspondences that can be decoupled from orientation control.

axioms (1)

domain assumption Pretrained single-view models possess strong 2D-to-3D grounding reusable for multi-view conditioning.
Stated as an observation in the abstract that enables the lightweight upgrade.

pith-pipeline@v0.9.0 · 5808 in / 1280 out tokens · 28855 ms · 2026-05-21T05:00:55.375473+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 12 internal anchors

[1]

Chang, J., Ye, C., Wu, Y., Chen, Y., Zhang, Y., Luo, Z., Li, C., Zhi, Y., Han, X.: Reconviagen: Towards accurate multi-view 3d object reconstruction via generation (2025),https://arxiv.org/abs/2510.23306

work page arXiv 2025
[2]

In: Proceedings of the IEEE/CVF international conference on computer vision

Chen, D.Z., Siddiqui, Y., Lee, H.Y., Tulyakov, S., Nießner, M.: Text2tex: Text- driven texture synthesis via diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 18558–18568 (2023)

work page 2023
[3]

In: ICCV

Chen, H., Gu, J., Chen, A., Tian, W., Tu, Z., Liu, L., Su, H.: Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. In: ICCV. pp. 2416– 2425 (2023)

work page 2023
[4]

In: ICCV

Chen, R., Han, S., Xu, J., Su, H.: Point-based multi-view stereo network. In: ICCV. pp. 1538–1547 (2019)

work page 2019
[5]

Ultra3d: Efficient and high- fidelity 3d generation with part attention.arXiv preprint arXiv:2507.17745, 2025

Chen, Y., Li, Z., Wang, Y., Zhang, H., Li, Q., Zhang, C., Lin, G.: Ultra3d: Efficient and high-fidelity 3d generation with part attention. arXiv preprint arXiv:2507.17745 (2025)

work page arXiv 2025
[6]

In: CVPR

Cheng, S., Xu, Z., Zhu, S., Li, Z., Li, L.E., Ramamoorthi, R., Su, H.: Deep stereo using adaptive thin volume representation with uncertainty awareness. In: CVPR. pp. 2524–2534 (2020)

work page 2020
[7]

NeurIPS36(2024)

Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., et al.: Objaverse-xl: A universe of 10m+ 3d objects. NeurIPS36(2024)

work page 2024
[8]

In: ICCV

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: ICCV. pp. 13142–13153 (2023)

work page 2023
[9]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Feng, Y., Yang, M., Yang, S., Zhang, S., Yu, J., Zhao, Z., Liu, Y., Jiang, J., Guo, C.: Romantex: Decoupling 3d-aware rotary positional embedded multi-attention network for texture synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17203–17213 (2025)

work page 2025
[10]

Foundations and trends®in Computer Graphics and Vision9(1-2), 1–148 (2015)

Furukawa, Y., Hernández, C., et al.: Multi-view stereo: A tutorial. Foundations and trends®in Computer Graphics and Vision9(1-2), 1–148 (2015)

work page 2015
[11]

In: ICCV

Galliani, S., Lasinger, K., Schindler, K.: Massively parallel multiview stereopsis by surface normal diffusion. In: ICCV. pp. 873–881 (2015)

work page 2015
[12]

In: CVPR

Gu, X., Fan, Z., Zhu, S., Dai, Z., Tan, F., Tan, P.: Cascade cost volume for high- resolution multi-view stereo and stereo matching. In: CVPR. pp. 2495–2504 (2020)

work page 2020
[13]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

He, Z., Yang, M., Yang, S., Tang, Y., Wang, T., Zhang, K., Chen, G., Liu, Y., Jiang, J., Guo, C., et al.: Materialmvp: Illumination-invariant material genera- tion via multi-view pbr diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 26294–26305 (2025)

work page 2025
[14]

Hitem3D Team: Hitem3d: High-quality 3d model generation service (2024),https: //www.hitem3d.ai/, accessed: 2024-05-20

work page 2024
[15]

In: ICLR (2023)

Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: Lrm: Large reconstruction model for single image to 3d. In: ICLR (2023)

work page 2023
[16]

In: SIGGRAPH Aisa

Hui, K.H., Li, R., Hu, J., Fu, C.W.: Neural wavelet-domain diffusion for 3d shape generation. In: SIGGRAPH Aisa. pp. 1–9 (2022)

work page 2022
[17]

Hunyuan3D, T., Yang, S., Yang, M., Feng, Y., Huang, X., Zhang, S., He, Z., Luo, D., Liu, H., Zhao, Y., Lin, Q., Lai, Z., Yang, X., Shi, H., Zhao, Z., Zhang, B., Yan, H., Wang, L., Liu, S., Zhang, J., Chen, M., Dong, L., Jia, Y., Cai, Y., Yu, J., Tang, Y., Guo, D., Yu, J., Zhang, H., Ye, Z., He, P., Wu, R., Wei, S., Zhang, C., Tan, Y., 16 Sun et al. Sun, Y...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Hy-3D Team: Hy-3d (2024),https://hy-3d.com, accessed: 2024-05-20

work page 2024
[19]

Hyper3D Team: Hyper3d: High-fidelity 3d asset generation (2024),https:// hyper3d.ai/, accessed: 2024-05-20

work page 2024
[20]

Categorical Reparameterization with Gumbel-Softmax

Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

TOG42(4), 139–1 (2023)

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. TOG42(4), 139–1 (2023)

work page 2023
[22]

Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details

Lai, Z., Zhao, Y., Liu, H., Zhao, Z., Lin, Q., Shi, H., Yang, X., Yang, M., Yang, S., Feng, Y., et al.: Hunyuan3d 2.5: Towards high-fidelity 3d assets generation with ultimate details. arXiv preprint arXiv:2506.16504 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

arXiv preprint arXiv:2512.03052 (2025)

Lai, Z., Zhao, Y., Zhao, Z., Liu, H., Lin, Q., Huang, J., Guo, C., Yue, X.: Lattice: Democratize high-fidelity 3d generation at scale. arXiv preprint arXiv:2512.03052 (2025)

work page arXiv 2025
[24]

Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model

Li, J., Tan, H., Zhang, K., Xu, Z., Luan, F., Xu, Y., Hong, Y., Sunkavalli, K., Shakhnarovich, G., Bi, S.: Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214 (2023)

work page arXiv 2023
[25]

arXiv preprint arXiv:2405.11616 (2024)

Li, P., Liu, Y., Long, X., Zhang, F., Lin, C., Li, M., Qi, X., Zhang, S., Luo, W., Tan, P., et al.: Era3d: High-resolution multiview diffusion using efficient row-wise attention. arXiv preprint arXiv:2405.11616 (2024)

work page arXiv 2024
[26]

2025.doi:10.48550/arXiv.2405.14979

Li, W., Liu, J., Chen, R., Liang, Y., Chen, X., Tan, P., Long, X.: Craftsman: High-fidelity mesh generation with 3d native generation and interactive geometry refiner. arXiv preprint arXiv:2405.14979 (2024)

work page arXiv 2024
[27]

TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

Li, Y., Zou, Z.X., Liu, Z., Wang, D., Liang, Y., Yu, Z., Liu, X., Guo, Y.C., Liang, D., Ouyang, W., et al.: Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models. arXiv preprint arXiv:2502.06608 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

In: CVPR

Lin, C.H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M.Y., Lin, T.Y.: Magic3d: High-resolution text-to-3d content creation. In: CVPR. pp. 300–309 (2023)

work page 2023
[29]

In: ICCV

Lin, C.H., Ma, W.C., Torralba, A., Lucey, S.: Barf: Bundle-adjusting neural radi- ance fields. In: ICCV. pp. 5741–5751 (2021)

work page 2021
[30]

In: CVPR

Liu, M., Shi, R., Chen, L., Zhang, Z., Xu, C., Wei, X., Chen, H., Zeng, C., Gu, J., Su, H.: One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. In: CVPR. pp. 10072–10083 (2024)

work page 2024
[31]

NeurIPS 36(2023)

Liu, M., Xu, C., Jin, H., Chen, L., Varma T, M., Xu, Z., Su, H.: One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. NeurIPS 36(2023)

work page 2023
[32]

In: SIGGRAPH Asia 2024 Conference Papers

Liu,Y.,Xie,M.,Liu,H.,Wong,T.T.:Text-guidedtexturingbysynchronizedmulti- view diffusion. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

work page 2024
[33]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

arXiv preprint arXiv:2511.16957 (2025)

Luo,D.,Yang,S.,Yang,M.,Lu,J.,Tang,Y.,Han,X.,Chen,Z.,Wang,B.,Guo,C.: Matpedia: A universal generative foundation for high-fidelity material synthesis. arXiv preprint arXiv:2511.16957 (2025)

work page arXiv 2025
[35]

In: CVPR

Luo, S., Hu, W.: Diffusion probabilistic models for 3d point cloud generation. In: CVPR. pp. 2837–2845 (2021) ROAR-3D 17

work page 2021
[36]

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.:Nerf:Representingscenesasneuralradiancefieldsforviewsynthesis.In:ECCV. pp. 405–421 (2020)

work page 2020
[37]

In: CVPR

Müller, N., Siddiqui, Y., Porzi, L., Bulo, S.R., Kontschieder, P., Nießner, M.: Diffrf: Rendering-guided 3d radiance field diffusion. In: CVPR. pp. 4328–4338 (2023)

work page 2023
[38]

Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: A system for generating3dpointcloudsfromcomplexprompts.arXivpreprintarXiv:2212.08751 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Transactions on Machine Learning Research Journal pp

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research Journal pp. 1–31 (2024)

work page 2024
[40]

DreamFusion: Text-to-3D using 2D Diffusion

Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

In: CVPR

Qiu, L., Chen, G., Gu, X., Zuo, Q., Xu, M., Wu, Y., Yuan, W., Dong, Z., Bo, L., Han, X.: Richdreamer: A generalizable normal-depth diffusion model for detail richness in text-to-3d. In: CVPR. pp. 9914–9925 (2024)

work page 2024
[42]

In: ECCV

Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: ECCV. pp. 501–518. Springer (2016)

work page 2016
[43]

In: CVPR

Shue, J.R., Chan, E.R., Po, R., Ankner, Z., Wu, J., Wetzstein, G.: 3d neural field generation using triplane diffusion. In: CVPR. pp. 20875–20886 (2023)

work page 2023
[44]

In: ECCV

Tang, J., Chen, Z., Chen, X., Wang, T., Zeng, G., Liu, Z.: Lgm: Large multi- view gaussian model for high-resolution 3d content creation. In: ECCV. pp. 1–18. Springer (2025)

work page 2025
[45]

In: ICLR (2024)

Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. In: ICLR (2024)

work page 2024
[46]

In: ICCV

Tang, J., Wang, T., Zhang, B., Zhang, T., Yi, R., Ma, L., Chen, D.: Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. In: ICCV. pp. 22819–22829 (2023)

work page 2023
[47]

V olumediffu- sion: Flexible text-to-3d generation with efficient volumetric encoder.arXiv preprint arXiv:2312.11459, 2023

Tang, Z., Gu, S., Wang, C., Zhang, T., Bao, J., Chen, D., Guo, B.: Volumediffusion: Flexible text-to-3d generation with efficient volumetric encoder. arXiv preprint arXiv:2312.11459 (2023)

work page arXiv 2023
[48]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

https://3d.hunyuan.tencent.com/ (2024)

Tencent Hunyuan: Hunyuan3d. https://3d.hunyuan.tencent.com/ (2024)

work page 2024
[50]

Tripo AI: Tripo: Fast 3d object generation from text and image (2024),https: //www.tripo3d.ai/, accessed: 2024-05-20

work page 2024
[51]

In: CVPR

Wang, F., Galliani, S., Vogel, C., Speciale, P., Pollefeys, M.: Patchmatchnet: Learned multi-view patchmatch stereo. In: CVPR. pp. 14194–14203 (2021)

work page 2021
[52]

arXiv preprint arXiv:2503.11651 (2025)

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. arXiv preprint arXiv:2503.11651 (2025)

work page arXiv 2025
[53]

In: CVPR

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: CVPR. pp. 20697–20709 (2024)

work page 2024
[54]

In: CVPR

Wang, T., Zhang, B., Zhang, T., Gu, S., Bao, J., Baltrusaitis, T., Shen, J., Chen, D., Wen, F., Chen, Q., et al.: Rodin: A generative model for sculpting 3d digital avatars using diffusion. In: CVPR. pp. 4563–4573 (2023)

work page 2023
[55]

NeurIPS36(2024) 18 Sun et al

Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. NeurIPS36(2024) 18 Sun et al

work page 2024
[56]

In: ECCV

Wang, Z., Wang, Y., Chen, Y., Xiang, C., Chen, S., Yu, D., Li, C., Su, H., Zhu, J.: Crm: Single image to 3d textured mesh with convolutional reconstruction model. In: ECCV. pp. 57–74. Springer (2025)

work page 2025
[57]

Wang, Z., Wu, S., Xie, W., Chen, M., Prisacariu, V.A.: Nerf–: Neural radiance fields without known camera parameters (2021)

work page 2021
[58]

Meshlrm: Large reconstruction model for high- quality meshes.arXiv preprint arXiv:2404.12385, 2024

Wei, X., Zhang, K., Bi, S., Tan, H., Luan, F., Deschaintre, V., Sunkavalli, K., Su, H., Xu, Z.: Meshlrm: Large reconstruction model for high-quality mesh. arXiv preprint arXiv:2404.12385 (2024)

work page arXiv 2024
[59]

arXiv preprint arXiv:2312.17250 (2023)

Wu, C.H., Chen, Y.C., Solarte, B., Yuan, L., Sun, M.: ifusion: Inverting diffusion for pose-free reconstruction from sparse views. arXiv preprint arXiv:2312.17250 (2023)

work page arXiv 2023
[60]

arXiv preprint arXiv:2405.20343 (2024)

Wu, K., Liu, F., Cai, Z., Yan, R., Wang, H., Hu, Y., Duan, Y., Ma, K.: Unique3d: High-quality and efficient 3d mesh generation from a single image. arXiv preprint arXiv:2405.20343 (2024)

work page arXiv 2024
[61]

Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer

Wu, S., Lin, Y., Zhang, F., Zeng, Y., Xu, J., Torr, P., Cao, X., Yao, Y.: Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer. arXiv preprint arXiv:2405.14832 (2024)

work page arXiv 2024
[62]

Xiang, J., Chen, X., Xu, S., Wang, R., Lv, Z., Deng, Y., Zhu, H., Dong, Y., Zhao, H., Yuan, N.J., Yang, J.: Native and compact structured latents for 3d generation (2025),https://arxiv.org/abs/2512.14692

work page internal anchor Pith review arXiv 2025
[63]

Structured 3D Latents for Scalable and Versatile 3D Generation

Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and versatile 3d generation. arXiv preprint arXiv:2412.01506 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

In: ECCV

Xu, C., Li, A., Chen, L., Liu, Y., Shi, R., Su, H., Liu, M.: Sparp: Fast 3d object reconstruction and pose estimation from sparse views. In: ECCV. pp. 143–163. Springer (2024)

work page 2024
[65]

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

Xu, J., Cheng, W., Gao, Y., Wang, X., Gao, S., Shan, Y.: Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models. arXiv preprint arXiv:2404.07191 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

In: CVPR

Xu, Q., Tao, W.: Multi-scale geometric consistency guided multi-view stereo. In: CVPR. pp. 5483–5492 (2019)

work page 2019
[67]

arXiv preprint arXiv:2403.14621 (2024)

Xu, Y., Shi, Z., Yifan, W., Chen, H., Yang, C., Peng, S., Shen, Y., Wetzstein, G.: Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation. arXiv preprint arXiv:2403.14621 (2024)

work page arXiv 2024
[68]

In: ICLR (2024)

Xu, Y., Tan, H., Luan, F., Bi, S., Wang, P., Li, J., Shi, Z., Sunkavalli, K., Wet- zstein, G., Xu, Z., et al.: Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. In: ICLR (2024)

work page 2024
[69]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xue, L., Gao, M., Xing, C., Martín-Martín, R., Wu, J., Xiong, C., Xu, R., Niebles, J.C., Savarese, S.: Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1179–1189 (June 2023)

work page 2023
[70]

In: CVPR

Yang, J., Mao, W., Alvarez, J.M., Liu, M.: Cost volume pyramid based depth inference for multi-view stereo. In: CVPR. pp. 4877–4886 (2020)

work page 2020
[71]

In: ECCV

Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: Mvsnet: Depth inference for unstruc- tured multi-view stereo. In: ECCV. pp. 767–783 (2018)

work page 2018
[72]

In: CVPR

Yao, Y., Luo, Z., Li, S., Shen, T., Fang, T., Quan, L.: Recurrent mvsnet for high- resolution multi-view stereo depth inference. In: CVPR. pp. 5525–5534 (2019)

work page 2019
[73]

Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging.arXiv preprint arXiv:2503.22236, 3:2,

Ye, C., Wu, Y., Lu, Z., Chang, J., Guo, X., Zhou, J., Zhao, H., Han, X.: Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging. arXiv preprint arXiv:2503.222363(2025) ROAR-3D 19

work page arXiv 2025
[74]

TOG42(4), 1–16 (2023)

Zhang, B., Tang, J., Niessner, M., Wonka, P.: 3dshape2vecset: A 3d shape rep- resentation for neural fields and generative diffusion models. TOG42(4), 1–16 (2023)

work page 2023
[75]

arXiv e-prints pp

Zhang, B., Cheng, Y., Yang, J., Wang, C., Zhao, F., Tang, Y., Chen, D., Guo, B.: Gaussiancube: Structuring gaussian splatting using optimal transport for 3d generative modeling. arXiv e-prints pp. arXiv–2403 (2024)

work page 2024
[76]

TOG43(4), 1–20 (2024)

Zhang, L., Wang, Z., Zhang, Q., Qiu, Q., Pang, A., Jiang, H., Yang, W., Xu, L., Yu, J.: Clay: A controllable large-scale generative model for creating high-quality 3d assets. TOG43(4), 1–20 (2024)

work page 2024
[77]

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Zhao, Z., Lai, Z., Lin, Q., Zhao, Y., Liu, H., Yang, S., Feng, Y., Yang, M., Zhang, S., Yang, X., et al.: Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation. arXiv preprint arXiv:2501.12202 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[78]

NeurIPS36(2024)

Zhao, Z., Liu, W., Chen, X., Zeng, X., Wang, R., Cheng, P., Fu, B., Chen, T., Yu, G., Gao, S.: Michelangelo: Conditional 3d shape generation based on shape-image- text aligned latent representation. NeurIPS36(2024)

work page 2024
[79]

In: International Conference on Learning Rep- resentations (ICLR) (2024)

Zhou, J., Wang, J., Ma, B., Liu, Y.S., Huang, T., Wang, X.: Uni3d: Exploring unified 3d representation at scale. In: International Conference on Learning Rep- resentations (ICLR) (2024)

work page 2024
[80]

In: ICCV

Zhou, L., Du, Y., Wu, J.: 3d shape generation and completion through point-voxel diffusion. In: ICCV. pp. 5826–5835 (2021)

work page 2021

Showing first 80 references.

[1] [1]

Chang, J., Ye, C., Wu, Y., Chen, Y., Zhang, Y., Luo, Z., Li, C., Zhi, Y., Han, X.: Reconviagen: Towards accurate multi-view 3d object reconstruction via generation (2025),https://arxiv.org/abs/2510.23306

work page arXiv 2025

[2] [2]

In: Proceedings of the IEEE/CVF international conference on computer vision

Chen, D.Z., Siddiqui, Y., Lee, H.Y., Tulyakov, S., Nießner, M.: Text2tex: Text- driven texture synthesis via diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 18558–18568 (2023)

work page 2023

[3] [3]

In: ICCV

Chen, H., Gu, J., Chen, A., Tian, W., Tu, Z., Liu, L., Su, H.: Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. In: ICCV. pp. 2416– 2425 (2023)

work page 2023

[4] [4]

In: ICCV

Chen, R., Han, S., Xu, J., Su, H.: Point-based multi-view stereo network. In: ICCV. pp. 1538–1547 (2019)

work page 2019

[5] [5]

Ultra3d: Efficient and high- fidelity 3d generation with part attention.arXiv preprint arXiv:2507.17745, 2025

Chen, Y., Li, Z., Wang, Y., Zhang, H., Li, Q., Zhang, C., Lin, G.: Ultra3d: Efficient and high-fidelity 3d generation with part attention. arXiv preprint arXiv:2507.17745 (2025)

work page arXiv 2025

[6] [6]

In: CVPR

Cheng, S., Xu, Z., Zhu, S., Li, Z., Li, L.E., Ramamoorthi, R., Su, H.: Deep stereo using adaptive thin volume representation with uncertainty awareness. In: CVPR. pp. 2524–2534 (2020)

work page 2020

[7] [7]

NeurIPS36(2024)

Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., et al.: Objaverse-xl: A universe of 10m+ 3d objects. NeurIPS36(2024)

work page 2024

[8] [8]

In: ICCV

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: ICCV. pp. 13142–13153 (2023)

work page 2023

[9] [9]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Feng, Y., Yang, M., Yang, S., Zhang, S., Yu, J., Zhao, Z., Liu, Y., Jiang, J., Guo, C.: Romantex: Decoupling 3d-aware rotary positional embedded multi-attention network for texture synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17203–17213 (2025)

work page 2025

[10] [10]

Foundations and trends®in Computer Graphics and Vision9(1-2), 1–148 (2015)

Furukawa, Y., Hernández, C., et al.: Multi-view stereo: A tutorial. Foundations and trends®in Computer Graphics and Vision9(1-2), 1–148 (2015)

work page 2015

[11] [11]

In: ICCV

Galliani, S., Lasinger, K., Schindler, K.: Massively parallel multiview stereopsis by surface normal diffusion. In: ICCV. pp. 873–881 (2015)

work page 2015

[12] [12]

In: CVPR

Gu, X., Fan, Z., Zhu, S., Dai, Z., Tan, F., Tan, P.: Cascade cost volume for high- resolution multi-view stereo and stereo matching. In: CVPR. pp. 2495–2504 (2020)

work page 2020

[13] [13]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

He, Z., Yang, M., Yang, S., Tang, Y., Wang, T., Zhang, K., Chen, G., Liu, Y., Jiang, J., Guo, C., et al.: Materialmvp: Illumination-invariant material genera- tion via multi-view pbr diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 26294–26305 (2025)

work page 2025

[14] [14]

Hitem3D Team: Hitem3d: High-quality 3d model generation service (2024),https: //www.hitem3d.ai/, accessed: 2024-05-20

work page 2024

[15] [15]

In: ICLR (2023)

Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: Lrm: Large reconstruction model for single image to 3d. In: ICLR (2023)

work page 2023

[16] [16]

In: SIGGRAPH Aisa

Hui, K.H., Li, R., Hu, J., Fu, C.W.: Neural wavelet-domain diffusion for 3d shape generation. In: SIGGRAPH Aisa. pp. 1–9 (2022)

work page 2022

[17] [17]

Hunyuan3D, T., Yang, S., Yang, M., Feng, Y., Huang, X., Zhang, S., He, Z., Luo, D., Liu, H., Zhao, Y., Lin, Q., Lai, Z., Yang, X., Shi, H., Zhao, Z., Zhang, B., Yan, H., Wang, L., Liu, S., Zhang, J., Chen, M., Dong, L., Jia, Y., Cai, Y., Yu, J., Tang, Y., Guo, D., Yu, J., Zhang, H., Ye, Z., He, P., Wu, R., Wei, S., Zhang, C., Tan, Y., 16 Sun et al. Sun, Y...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Hy-3D Team: Hy-3d (2024),https://hy-3d.com, accessed: 2024-05-20

work page 2024

[19] [19]

Hyper3D Team: Hyper3d: High-fidelity 3d asset generation (2024),https:// hyper3d.ai/, accessed: 2024-05-20

work page 2024

[20] [20]

Categorical Reparameterization with Gumbel-Softmax

Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[21] [21]

TOG42(4), 139–1 (2023)

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. TOG42(4), 139–1 (2023)

work page 2023

[22] [22]

Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details

Lai, Z., Zhao, Y., Liu, H., Zhao, Z., Lin, Q., Shi, H., Yang, X., Yang, M., Yang, S., Feng, Y., et al.: Hunyuan3d 2.5: Towards high-fidelity 3d assets generation with ultimate details. arXiv preprint arXiv:2506.16504 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

arXiv preprint arXiv:2512.03052 (2025)

Lai, Z., Zhao, Y., Zhao, Z., Liu, H., Lin, Q., Huang, J., Guo, C., Yue, X.: Lattice: Democratize high-fidelity 3d generation at scale. arXiv preprint arXiv:2512.03052 (2025)

work page arXiv 2025

[24] [24]

Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model

Li, J., Tan, H., Zhang, K., Xu, Z., Luan, F., Xu, Y., Hong, Y., Sunkavalli, K., Shakhnarovich, G., Bi, S.: Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214 (2023)

work page arXiv 2023

[25] [25]

arXiv preprint arXiv:2405.11616 (2024)

Li, P., Liu, Y., Long, X., Zhang, F., Lin, C., Li, M., Qi, X., Zhang, S., Luo, W., Tan, P., et al.: Era3d: High-resolution multiview diffusion using efficient row-wise attention. arXiv preprint arXiv:2405.11616 (2024)

work page arXiv 2024

[26] [26]

2025.doi:10.48550/arXiv.2405.14979

Li, W., Liu, J., Chen, R., Liang, Y., Chen, X., Tan, P., Long, X.: Craftsman: High-fidelity mesh generation with 3d native generation and interactive geometry refiner. arXiv preprint arXiv:2405.14979 (2024)

work page arXiv 2024

[27] [27]

TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

Li, Y., Zou, Z.X., Liu, Z., Wang, D., Liang, Y., Yu, Z., Liu, X., Guo, Y.C., Liang, D., Ouyang, W., et al.: Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models. arXiv preprint arXiv:2502.06608 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

In: CVPR

Lin, C.H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M.Y., Lin, T.Y.: Magic3d: High-resolution text-to-3d content creation. In: CVPR. pp. 300–309 (2023)

work page 2023

[29] [29]

In: ICCV

Lin, C.H., Ma, W.C., Torralba, A., Lucey, S.: Barf: Bundle-adjusting neural radi- ance fields. In: ICCV. pp. 5741–5751 (2021)

work page 2021

[30] [30]

In: CVPR

Liu, M., Shi, R., Chen, L., Zhang, Z., Xu, C., Wei, X., Chen, H., Zeng, C., Gu, J., Su, H.: One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. In: CVPR. pp. 10072–10083 (2024)

work page 2024

[31] [31]

NeurIPS 36(2023)

Liu, M., Xu, C., Jin, H., Chen, L., Varma T, M., Xu, Z., Su, H.: One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. NeurIPS 36(2023)

work page 2023

[32] [32]

In: SIGGRAPH Asia 2024 Conference Papers

Liu,Y.,Xie,M.,Liu,H.,Wong,T.T.:Text-guidedtexturingbysynchronizedmulti- view diffusion. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

work page 2024

[33] [33]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [34]

arXiv preprint arXiv:2511.16957 (2025)

Luo,D.,Yang,S.,Yang,M.,Lu,J.,Tang,Y.,Han,X.,Chen,Z.,Wang,B.,Guo,C.: Matpedia: A universal generative foundation for high-fidelity material synthesis. arXiv preprint arXiv:2511.16957 (2025)

work page arXiv 2025

[35] [35]

In: CVPR

Luo, S., Hu, W.: Diffusion probabilistic models for 3d point cloud generation. In: CVPR. pp. 2837–2845 (2021) ROAR-3D 17

work page 2021

[36] [36]

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.:Nerf:Representingscenesasneuralradiancefieldsforviewsynthesis.In:ECCV. pp. 405–421 (2020)

work page 2020

[37] [37]

In: CVPR

Müller, N., Siddiqui, Y., Porzi, L., Bulo, S.R., Kontschieder, P., Nießner, M.: Diffrf: Rendering-guided 3d radiance field diffusion. In: CVPR. pp. 4328–4338 (2023)

work page 2023

[38] [38]

Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: A system for generating3dpointcloudsfromcomplexprompts.arXivpreprintarXiv:2212.08751 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [39]

Transactions on Machine Learning Research Journal pp

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research Journal pp. 1–31 (2024)

work page 2024

[40] [40]

DreamFusion: Text-to-3D using 2D Diffusion

Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[41] [41]

In: CVPR

Qiu, L., Chen, G., Gu, X., Zuo, Q., Xu, M., Wu, Y., Yuan, W., Dong, Z., Bo, L., Han, X.: Richdreamer: A generalizable normal-depth diffusion model for detail richness in text-to-3d. In: CVPR. pp. 9914–9925 (2024)

work page 2024

[42] [42]

In: ECCV

Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: ECCV. pp. 501–518. Springer (2016)

work page 2016

[43] [43]

In: CVPR

Shue, J.R., Chan, E.R., Po, R., Ankner, Z., Wu, J., Wetzstein, G.: 3d neural field generation using triplane diffusion. In: CVPR. pp. 20875–20886 (2023)

work page 2023

[44] [44]

In: ECCV

Tang, J., Chen, Z., Chen, X., Wang, T., Zeng, G., Liu, Z.: Lgm: Large multi- view gaussian model for high-resolution 3d content creation. In: ECCV. pp. 1–18. Springer (2025)

work page 2025

[45] [45]

In: ICLR (2024)

Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. In: ICLR (2024)

work page 2024

[46] [46]

In: ICCV

Tang, J., Wang, T., Zhang, B., Zhang, T., Yi, R., Ma, L., Chen, D.: Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. In: ICCV. pp. 22819–22829 (2023)

work page 2023

[47] [47]

V olumediffu- sion: Flexible text-to-3d generation with efficient volumetric encoder.arXiv preprint arXiv:2312.11459, 2023

Tang, Z., Gu, S., Wang, C., Zhang, T., Bao, J., Chen, D., Guo, B.: Volumediffusion: Flexible text-to-3d generation with efficient volumetric encoder. arXiv preprint arXiv:2312.11459 (2023)

work page arXiv 2023

[48] [48]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

https://3d.hunyuan.tencent.com/ (2024)

Tencent Hunyuan: Hunyuan3d. https://3d.hunyuan.tencent.com/ (2024)

work page 2024

[50] [50]

Tripo AI: Tripo: Fast 3d object generation from text and image (2024),https: //www.tripo3d.ai/, accessed: 2024-05-20

work page 2024

[51] [51]

In: CVPR

Wang, F., Galliani, S., Vogel, C., Speciale, P., Pollefeys, M.: Patchmatchnet: Learned multi-view patchmatch stereo. In: CVPR. pp. 14194–14203 (2021)

work page 2021

[52] [52]

arXiv preprint arXiv:2503.11651 (2025)

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. arXiv preprint arXiv:2503.11651 (2025)

work page arXiv 2025

[53] [53]

In: CVPR

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: CVPR. pp. 20697–20709 (2024)

work page 2024

[54] [54]

In: CVPR

Wang, T., Zhang, B., Zhang, T., Gu, S., Bao, J., Baltrusaitis, T., Shen, J., Chen, D., Wen, F., Chen, Q., et al.: Rodin: A generative model for sculpting 3d digital avatars using diffusion. In: CVPR. pp. 4563–4573 (2023)

work page 2023

[55] [55]

NeurIPS36(2024) 18 Sun et al

Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. NeurIPS36(2024) 18 Sun et al

work page 2024

[56] [56]

In: ECCV

Wang, Z., Wang, Y., Chen, Y., Xiang, C., Chen, S., Yu, D., Li, C., Su, H., Zhu, J.: Crm: Single image to 3d textured mesh with convolutional reconstruction model. In: ECCV. pp. 57–74. Springer (2025)

work page 2025

[57] [57]

Wang, Z., Wu, S., Xie, W., Chen, M., Prisacariu, V.A.: Nerf–: Neural radiance fields without known camera parameters (2021)

work page 2021

[58] [58]

Meshlrm: Large reconstruction model for high- quality meshes.arXiv preprint arXiv:2404.12385, 2024

Wei, X., Zhang, K., Bi, S., Tan, H., Luan, F., Deschaintre, V., Sunkavalli, K., Su, H., Xu, Z.: Meshlrm: Large reconstruction model for high-quality mesh. arXiv preprint arXiv:2404.12385 (2024)

work page arXiv 2024

[59] [59]

arXiv preprint arXiv:2312.17250 (2023)

Wu, C.H., Chen, Y.C., Solarte, B., Yuan, L., Sun, M.: ifusion: Inverting diffusion for pose-free reconstruction from sparse views. arXiv preprint arXiv:2312.17250 (2023)

work page arXiv 2023

[60] [60]

arXiv preprint arXiv:2405.20343 (2024)

Wu, K., Liu, F., Cai, Z., Yan, R., Wang, H., Hu, Y., Duan, Y., Ma, K.: Unique3d: High-quality and efficient 3d mesh generation from a single image. arXiv preprint arXiv:2405.20343 (2024)

work page arXiv 2024

[61] [61]

Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer

Wu, S., Lin, Y., Zhang, F., Zeng, Y., Xu, J., Torr, P., Cao, X., Yao, Y.: Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer. arXiv preprint arXiv:2405.14832 (2024)

work page arXiv 2024

[62] [62]

Xiang, J., Chen, X., Xu, S., Wang, R., Lv, Z., Deng, Y., Zhu, H., Dong, Y., Zhao, H., Yuan, N.J., Yang, J.: Native and compact structured latents for 3d generation (2025),https://arxiv.org/abs/2512.14692

work page internal anchor Pith review arXiv 2025

[63] [63]

Structured 3D Latents for Scalable and Versatile 3D Generation

Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and versatile 3d generation. arXiv preprint arXiv:2412.01506 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

In: ECCV

Xu, C., Li, A., Chen, L., Liu, Y., Shi, R., Su, H., Liu, M.: Sparp: Fast 3d object reconstruction and pose estimation from sparse views. In: ECCV. pp. 143–163. Springer (2024)

work page 2024

[65] [65]

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

Xu, J., Cheng, W., Gao, Y., Wang, X., Gao, S., Shan, Y.: Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models. arXiv preprint arXiv:2404.07191 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[66] [66]

In: CVPR

Xu, Q., Tao, W.: Multi-scale geometric consistency guided multi-view stereo. In: CVPR. pp. 5483–5492 (2019)

work page 2019

[67] [67]

arXiv preprint arXiv:2403.14621 (2024)

Xu, Y., Shi, Z., Yifan, W., Chen, H., Yang, C., Peng, S., Shen, Y., Wetzstein, G.: Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation. arXiv preprint arXiv:2403.14621 (2024)

work page arXiv 2024

[68] [68]

In: ICLR (2024)

Xu, Y., Tan, H., Luan, F., Bi, S., Wang, P., Li, J., Shi, Z., Sunkavalli, K., Wet- zstein, G., Xu, Z., et al.: Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. In: ICLR (2024)

work page 2024

[69] [69]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xue, L., Gao, M., Xing, C., Martín-Martín, R., Wu, J., Xiong, C., Xu, R., Niebles, J.C., Savarese, S.: Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1179–1189 (June 2023)

work page 2023

[70] [70]

In: CVPR

Yang, J., Mao, W., Alvarez, J.M., Liu, M.: Cost volume pyramid based depth inference for multi-view stereo. In: CVPR. pp. 4877–4886 (2020)

work page 2020

[71] [71]

In: ECCV

Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: Mvsnet: Depth inference for unstruc- tured multi-view stereo. In: ECCV. pp. 767–783 (2018)

work page 2018

[72] [72]

In: CVPR

Yao, Y., Luo, Z., Li, S., Shen, T., Fang, T., Quan, L.: Recurrent mvsnet for high- resolution multi-view stereo depth inference. In: CVPR. pp. 5525–5534 (2019)

work page 2019

[73] [73]

Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging.arXiv preprint arXiv:2503.22236, 3:2,

Ye, C., Wu, Y., Lu, Z., Chang, J., Guo, X., Zhou, J., Zhao, H., Han, X.: Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging. arXiv preprint arXiv:2503.222363(2025) ROAR-3D 19

work page arXiv 2025

[74] [74]

TOG42(4), 1–16 (2023)

Zhang, B., Tang, J., Niessner, M., Wonka, P.: 3dshape2vecset: A 3d shape rep- resentation for neural fields and generative diffusion models. TOG42(4), 1–16 (2023)

work page 2023

[75] [75]

arXiv e-prints pp

Zhang, B., Cheng, Y., Yang, J., Wang, C., Zhao, F., Tang, Y., Chen, D., Guo, B.: Gaussiancube: Structuring gaussian splatting using optimal transport for 3d generative modeling. arXiv e-prints pp. arXiv–2403 (2024)

work page 2024

[76] [76]

TOG43(4), 1–20 (2024)

Zhang, L., Wang, Z., Zhang, Q., Qiu, Q., Pang, A., Jiang, H., Yang, W., Xu, L., Yu, J.: Clay: A controllable large-scale generative model for creating high-quality 3d assets. TOG43(4), 1–20 (2024)

work page 2024

[77] [77]

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Zhao, Z., Lai, Z., Lin, Q., Zhao, Y., Liu, H., Yang, S., Feng, Y., Yang, M., Zhang, S., Yang, X., et al.: Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation. arXiv preprint arXiv:2501.12202 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[78] [78]

NeurIPS36(2024)

Zhao, Z., Liu, W., Chen, X., Zeng, X., Wang, R., Cheng, P., Fu, B., Chen, T., Yu, G., Gao, S.: Michelangelo: Conditional 3d shape generation based on shape-image- text aligned latent representation. NeurIPS36(2024)

work page 2024

[79] [79]

In: International Conference on Learning Rep- resentations (ICLR) (2024)

Zhou, J., Wang, J., Ma, B., Liu, Y.S., Huang, T., Wang, X.: Uni3d: Exploring unified 3d representation at scale. In: International Conference on Learning Rep- resentations (ICLR) (2024)

work page 2024

[80] [80]

In: ICCV

Zhou, L., Du, Y., Wu, J.: 3d shape generation and completion through point-voxel diffusion. In: ICCV. pp. 5826–5835 (2021)

work page 2021