pith. sign in

arxiv: 2606.24874 · v1 · pith:KWL4N527new · submitted 2026-06-23 · 💻 cs.CV · cs.AI

FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation

Pith reviewed 2026-06-26 00:12 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords 3D Gaussian Splattingimage-to-3D generationdiffusion transformersparse voxel representationcross-modal alignmenthigh-fidelity reconstruction
0
0 comments X

The pith

FLUX3D uses diffusion-aligned structured latents and a sparse multimodal transformer to overcome representation and alignment bottlenecks in image-to-3D Gaussian Splatting generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that two structural issues limit current sparse-voxel 3DGS methods: discriminative 2D features that discard reconstructive detail, and diffusion transformers that fail to align 2D image tokens with 3D voxel latents. It introduces DA-SLAT to select features suited to reconstruction inside a decoder-only pipeline, plus SMDiT and MARoPE to enable geometry-agnostic 2D-3D correspondence during generation. If these changes succeed, generated 3D assets retain high-frequency appearance from the input image at scale. Readers would care because many downstream uses in graphics and simulation depend on visual fidelity rather than coarse geometry alone. Benchmark results are presented as evidence that the combined changes produce measurable gains over prior approaches.

Core claim

FLUX3D addresses the representation bottleneck by replacing standard 2D features with Diffusion-Aligned Structured Latents (DA-SLAT) inside a decoder-only architecture and resolves the cross-modal alignment bottleneck by introducing the Sparse-structure Multimodal Diffusion Transformer (SMDiT) together with Modal-Aware Rotary Positional Embedding (MARoPE), which together enable higher-fidelity 3D Gaussian Splatting from single images.

What carries the argument

Diffusion-Aligned Structured Latents (DA-SLAT) paired with SMDiT and MARoPE, which together select reconstructive 2D features and enforce explicit 2D-3D token correspondence inside the diffusion process.

If this is right

  • Generated 3DGS assets preserve high-frequency texture and edge information from the source photograph.
  • The framework scales to larger sparse voxel grids without the previous loss of reconstructive cues.
  • Cross-modal alignment occurs without requiring explicit geometry supervision during diffusion.
  • Benchmark scores exceed all prior sparse-voxel 3DGS generators on standard fidelity metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment mechanism could be tested on multi-view or video inputs to produce temporally consistent 4D assets.
  • MARoPE-style embeddings might transfer to other sparse-dense modality pairs such as point-cloud-to-image tasks.
  • If the representation change proves robust, it could reduce reliance on large-scale discriminative pretraining for 3D reconstruction pipelines.

Load-bearing premise

The two bottlenecks identified in the paper are the dominant causes of fidelity loss, and the proposed DA-SLAT plus SMDiT/MARoPE components directly eliminate them.

What would settle it

A side-by-side ablation on the same training data showing that removing DA-SLAT or the MARoPE alignment terms produces no measurable drop in PSNR, LPIPS, or visual detail on held-out images.

Figures

Figures reproduced from arXiv: 2606.24874 by Haorui Ji, Hengkai Guo, Hongdong Li, Weizhe Liu.

Figure 1
Figure 1. Figure 1: We present FLUX3D, a scalable image-to-3DGS generation framework based on diffusion-aligned sparse voxel representation that achieves high appearance fidelity. The figure shows, from left to right, the input 2D image, 360-degree renderings of the generated 3D Gaussians, and a visualization of the centers of all Gaussians that compose the 3D assets. Abstract. Sparse voxel representation has emerged as a sca… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of FLUX3D for 3D Gaussian generation. Given the condition image and sparse structure layout, we use SMDiT module to generate diffusion-aligned struc￾tured latents. The obtained latents can then be transformed into target 3D Gaussians by a sparse transformer-based decoder. In this paper, we apply rectified flow to 3D generation and adapt its model ar￾chitecture for sparse voxel data structure to im… view at source ↗
Figure 3
Figure 3. Figure 3: An illustration of our representation learning pipeline. We extend TRELLIS￾SLAT by two modifications: we use FLUX features (as opposed to DINOv2 features) aggregated from multiview renderings to construct diffusion-aligned structured latent space, and use a decoder-only architecture (as opposed to a standard encoder-decoder one) to directly map the structured latent space to 3DGS output. harmonics. During … view at source ↗
Figure 4
Figure 4. Figure 4: An illustration of the SMDiT network architecture. SMDiT employs a transformer architecture comprising both double-stream and single-stream blocks. It incorporates the characteristics of sparse-structured latent space and fa￾cilitates interactions between modalities [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual comparisons between our method and SOTA 3DGS generation methods, including LGM, DiffusionGS, and TRELLIS. From left to right, the columns show the input 2D image, followed by the 3D reconstructions from each competing method and our approach, rendered from two viewpoints. Our method consistently produces results with high-fidelity appearance modeling across different object categories, such as chara… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the influence of input feature selection on sparse voxel rep￾resentation learning. We provide zoomed-in visualizations of local regions to enable granular evaluation of detail appearance quality. We evaluate six feature extractors: FLUX (in both decoder-only and encoder-decoder configurations), Stable Diffusion 2.1 (SD2.1), DINOv3, DINOv2, and vanilla RGB pixels. Our results confirm that d… view at source ↗
Figure 8
Figure 8. Figure 8: Visualization on the influence of each design component during genera￾tion. We disable individual module: the DA-SLAT, the SMDiT architecture, and the MARoPE design, to isolate and un￾derstand their respective impact. to appearance details modeling, while MARoPE alleviates smoothing artifacts, enhancing accurate alignment. In summary, the full configuration achieves the best results on all evaluated benchm… view at source ↗
Figure 1
Figure 1. Figure 1: Examples from ObjaverseXL￾Github. The first and second rows are assets with missing textures, where the rendering engine will automatically ren￾der such areas as pink or white. The third row are assets with normal display [PITH_FULL_IMAGE:figures/full_fig_p020_1.png] view at source ↗
read the original abstract

Sparse voxel representation has emerged as a scalable foundation for image-to-3D Gaussian Splatting (3DGS) generation, yet current methods struggle to preserve high-frequency visual details of input images due to two structural bottlenecks. First, they adopt discriminative 2D features optimized for semantic abstraction to construct sparse voxel latents, which suppress reconstructive cues and induce a representation bottleneck. Second, in the generation stage, standard diffusion transformers lack effective mechanisms to align dense 2D image tokens with sparse 3D voxel latents, resulting in a cross-modal correspondence bottleneck. To address these issues, we propose FLUX3D, a scalable image-to-3DGS framework that boosts both representation learning and cross-modal alignment during generation. We first revisit 2D feature selection for sparse-voxel-based 3D representation learning, propose Diffusion-Aligned Structured Latents (DA-SLAT) and couple it with a decoder-only architecture to improve 3DGS reconstruction fidelity. We also design a sparse-structure-aware diffusion framework, which integrates the Sparse-structure Multimodal Diffusion Transformer (SMDiT) and Modal-Aware Rotary Positional Embedding (MARoPE) to achieve geometry-agnostic 2D-3D alignment. Extensive benchmark experiments demonstrate that FLUX3D yields substantial improvements in appearance fidelity and significantly outperforms all state-of-the-art (SOTA) methods in generating high-quality 3DGS assets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper identifies two bottlenecks in sparse voxel-based image-to-3D Gaussian Splatting generation—use of discriminative 2D features that suppress reconstructive cues, and ineffective 2D-3D alignment in diffusion transformers—and proposes FLUX3D to address them via Diffusion-Aligned Structured Latents (DA-SLAT) paired with a decoder-only architecture, plus a Sparse-structure Multimodal Diffusion Transformer (SMDiT) and Modal-Aware Rotary Positional Embedding (MARoPE) for geometry-agnostic alignment. It claims that these changes yield substantial improvements in appearance fidelity and significantly outperform all state-of-the-art methods on benchmark experiments.

Significance. If the empirical claims hold with proper ablations and controlled comparisons, the work could advance scalable high-fidelity 3DGS asset generation by refining feature selection and cross-modal mechanisms in sparse representations. However, the abstract provides no quantitative metrics, ablation results, or implementation details, so the significance cannot be assessed from the given material.

major comments (1)
  1. Abstract: the central claim that DA-SLAT plus SMDiT/MARoPE directly resolve the two stated bottlenecks and produce substantial fidelity gains is presented without any supporting quantitative evidence, ablation studies, error bars, or references to tables/figures; this is load-bearing for the paper's contribution and cannot be evaluated from the provided text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and the opportunity to clarify the presentation of our contributions. We respond to the major comment below.

read point-by-point responses
  1. Referee: Abstract: the central claim that DA-SLAT plus SMDiT/MARoPE directly resolve the two stated bottlenecks and produce substantial fidelity gains is presented without any supporting quantitative evidence, ablation studies, error bars, or references to tables/figures; this is load-bearing for the paper's contribution and cannot be evaluated from the provided text.

    Authors: We agree that the abstract, being a concise high-level summary, does not embed specific quantitative metrics, ablation details, or direct table/figure references. The full manuscript contains the supporting evidence for these claims, including benchmark results demonstrating fidelity improvements, controlled ablations on the proposed components, and comparisons against prior methods. To address the concern and improve standalone readability of the abstract, we will revise it to incorporate key quantitative highlights and pointers to the relevant experimental sections. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper is a methodological contribution proposing DA-SLAT, SMDiT, and MARoPE to address two stated bottlenecks in image-to-3DGS pipelines. No equations, fitted parameters, or first-principles derivations are present that reduce to inputs by construction. Claims rest on architectural design choices and benchmark comparisons rather than self-referential definitions, renamed empirical patterns, or load-bearing self-citations. The provided abstract and description contain no instances matching any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no technical details, equations, or implementation specifics are provided to identify free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5800 in / 1130 out tokens · 24446 ms · 2026-06-26T00:12:05.162047+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 27 canonical work pages · 16 internal anchors

  1. [1]

    1 kontext: Flow matching for in-context image generation and editing in latent space

    Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., En- glish, J., English, Z., Esser, P., Kulal, S., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv e-prints pp. arXiv–2506 (2025)

  2. [2]

    In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

    Cai, Y., Zhang, H., Zhang, K., Liang, Y., Ren, M., Luan, F., Liu, Q., Kim, S.Y., Zhang,J.,Zhang,Z.,etal.:Bakinggaussiansplattingintodiffusiondenoiserforfast and scalable single-stage image-to-3D generation and reconstruction. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 25062– 25072 (2025)

  3. [3]

    arXiv preprint arXiv:2507.17745 (2025)

    Chen, Y., Li, Z., Wang, Y., Zhang, H., Li, Q., Zhang, C., Lin, G.: Ultra3D: Efficient and high-fidelity 3D generation with part attention. arXiv preprint arXiv:2507.17745 (2025)

  4. [4]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Chen, Z., Tang, J., Dong, Y., Cao, Z., Hong, F., Lan, Y., Wang, T., Xie, H., Wu, T., Saito, S., et al.: 3Dtopia-xl: Scaling high-quality 3D asset generation via prim- itive diffusion. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26576–26586 (2025)

  5. [5]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Collins, J., Goel, S., Deng, K., Luthra, A., Xu, L., Gundogdu, E., Zhang, X., Vicente, T.F.Y., Dideriksen, T., Arora, H., et al.: Abo: Dataset and benchmarks for real-world 3D object understanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21126–21136 (2022)

  6. [6]

    Advances in Neural Information Processing Systems36, 35799–35813 (2023)

    Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., et al.: Objaverse-xl: A universe of 10m+ 3D objects. Advances in Neural Information Processing Systems36, 35799–35813 (2023)

  7. [7]

    In: Forty-first international conference on machine learning (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

  8. [8]

    arXiv preprint arXiv:2503.19011 (2025)

    Feng, Y., Yang, M., Yang, S., Zhang, S., Yu, J., Zhao, Z., Liu, Y., Jiang, J., Guo, C.: Romantex: Decoupling 3d-aware rotary positional embedded multi-attention network for texture synthesis. arXiv preprint arXiv:2503.19011 (2025)

  9. [9]

    International Journal of Computer Vision129(12), 3313–3337 (2021)

    Fu, H., Jia, R., Gao, L., Gong, M., Zhao, B., Maybank, S., Tao, D.: 3D-future: 3D furniture shape with texture. International Journal of Computer Vision129(12), 3313–3337 (2021)

  10. [10]

    CAT3D: Create Anything in 3D with Multi-View Diffusion Models

    Gao, R., Holynski, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P., Barron, J.T., Poole, B.: Cat3D: Create anything in 3D with multi-view diffusion models. arXiv preprint arXiv:2405.10314 (2024)

  11. [11]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Gao, Y., Guo, H., Hoang, T., Huang, W., Jiang, L., Kong, F., Li, H., Li, J., Li, L., Li, X., et al.: Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113 (2025)

  12. [12]

    arXiv preprint arXiv:2403.13788 (2024)

    Gui, M., Fischer, J.S., Prestel, U., Ma, P., Kotovenko, D., Grebenkova, O., Bau- mann, S.A., Hu, V.T., Ommer, B.: Depthfm: Fast monocular depth estimation with flow matching. arXiv preprint arXiv:2403.13788 (2024)

  13. [13]

    Advances in neural information processing systems33, 6840–6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

  14. [14]

    LRM: Large Reconstruction Model for Single Image to 3D

    Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: Lrm: Large reconstruction model for single image to 3D. arXiv preprint arXiv:2311.04400 (2023) 16 H. Ji et al

  15. [15]

    arXiv preprint arXiv:2401.08930 (2024)

    Ji, H., Li, H.: 3d human pose analysis via diffusion synthesis. arXiv preprint arXiv:2401.08930 (2024)

  16. [16]

    arXiv preprint arXiv:2412.20506 (2024)

    Ji, H., Lin, T., Li, H.: Dpbridge: Latent diffusion bridge for dense prediction. arXiv preprint arXiv:2412.20506 (2024)

  17. [17]

    arXiv preprint arXiv:2412.20470 (2024)

    Ji, H., Wang, R., Lin, T., Li, H.: Jade: Joint-aware latent diffusion for 3d human generative modeling. arXiv preprint arXiv:2412.20470 (2024)

  18. [18]

    Shap-E: Generating Conditional 3D Implicit Functions

    Jun, H., Nichol, A.: Shap-e: Generating conditional 3D implicit functions. arXiv preprint arXiv:2305.02463 (2023)

  19. [19]

    In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition

    Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Re- purposing diffusion-based image generators for monocular depth estimation. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 9492–9502 (2024)

  20. [20]

    ACM Trans

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023)

  21. [21]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Khanna, M., Mao, Y., Jiang, H., Haresh, S., Shacklett, B., Batra, D., Clegg, A., Undersander, E., Chang, A.X., Savva, M.: Habitat synthetic scenes dataset (hssd- 200): An analysis of 3D scene scale and realism tradeoffs for objectgoal navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16384–16393 (2024)

  22. [22]

    Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024)

  23. [23]

    arXiv preprint arXiv:2411.08033 (2024)

    Lan, Y., Zhou, S., Lyu, Z., Hong, F., Yang, S., Dai, B., Pan, X., Loy, C.C.: Gaus- siananything: Interactive point cloud flow matching for 3d object generation. arXiv preprint arXiv:2411.08033 (2024)

  24. [24]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Li, W., Liu, J., Yan, H., Chen, R., Liang, Y., Chen, X., Tan, P., Long, X.: Crafts- man3d: High-fidelity mesh generation with 3d native diffusion and interactive ge- ometry refiner. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5307–5317 (2025)

  25. [25]

    TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

    Li, Y., Zou, Z.X., Liu, Z., Wang, D., Liang, Y., Yu, Z., Liu, X., Guo, Y.C., Liang, D., Ouyang, W., et al.: Triposg: High-fidelity 3D shape synthesis using large-scale rectified flow models. arXiv preprint arXiv:2502.06608 (2025)

  26. [26]

    arXiv preprint arXiv:2505.14521 (2025)

    Li, Z., Wang, Y., Zheng, H., Luo, Y., Wen, B.: Sparc3D: Sparse representa- tion and construction for high-resolution 3D shapes modeling. arXiv preprint arXiv:2505.14521 (2025)

  27. [27]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

  28. [28]

    Advances in Neural Information Processing Systems36, 22226–22246 (2023)

    Liu, M., Xu, C., Jin, H., Chen, L., Varma T, M., Xu, Z., Su, H.: One-2-3-45: Any single image to 3D mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems36, 22226–22246 (2023)

  29. [29]

    In: Proceedings of the IEEE/CVF inter- national conference on computer vision

    Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero- 1-to-3: Zero-shot one image to 3D object. In: Proceedings of the IEEE/CVF inter- national conference on computer vision. pp. 9298–9309 (2023)

  30. [30]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

  31. [31]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  32. [32]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  33. [33]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) FLUX3D 17

  34. [34]

    DreamFusion: Text-to-3D using 2D Diffusion

    Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)

  35. [35]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ren, X., Huang, J., Zeng, X., Museth, K., Fidler, S., Williams, F.: Xcube: Large- scale 3D generative modeling using sparse voxel hierarchies. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4209–4219 (2024)

  36. [36]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Seedream, T., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y., et al.: Seedream 4.0: Toward next-generation multimodal image generation, 2025. URL https://arxiv. org/abs/2509.20427

  37. [37]

    ACM Transactions on Graphics (TOG)42(4), 1–16 (2023)

    Shen, T., Munkberg, J., Hasselgren, J., Yin, K., Wang, Z., Chen, W., Gojcic, Z., Fidler, S., Sharp, N., Gao, J.: Flexible isosurface extraction for gradient-based mesh optimization. ACM Transactions on Graphics (TOG)42(4), 1–16 (2023)

  38. [38]

    DINOv3

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

  39. [39]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)

  40. [40]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Stojanov, S., Thai, A., Rehg, J.M.: Using shape to categorize: Low-shot learn- ing with an explicit shape bias. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1798–1808 (2021)

  41. [41]

    Neurocomputing568, 127063 (2024)

    Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced trans- former with rotary position embedding. Neurocomputing568, 127063 (2024)

  42. [42]

    arXiv preprint arXiv:2411.15098 (2024)

    Tan, Z., Liu, S., Yang, X., Xue, Q., Wang, X.: Ominicontrol: Minimal and universal control for diffusion transformer. arXiv preprint arXiv:2411.15098 (2024)

  43. [43]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Tan, Z., Liu, S., Yang, X., Xue, Q., Wang, X.: Ominicontrol: Minimal and universal control for diffusion transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14940–14950 (2025)

  44. [44]

    In: European Conference on Computer Vision

    Tang, J., Chen, Z., Chen, X., Wang, T., Zeng, G., Liu, Z.: Lgm: Large multi-view gaussian model for high-resolution 3D content creation. In: European Conference on Computer Vision. pp. 1–18. Springer (2024)

  45. [45]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Tang, J., Wang, T., Zhang, B., Zhang, T., Yi, R., Ma, L., Chen, D.: Make-it-3D: High-fidelity 3D creation from a single image with diffusion prior. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 22819–22829 (2023)

  46. [46]

    arXiv preprint arXiv:2503.09277 (2025)

    Wang, H., Peng, J., He, Q., Yang, H., Jin, Y., Wu, J., Hu, X., Pan, Y., Gan, Z., Chi, M., et al.: Unicombine: Unified multi-conditional combination with diffusion transformer. arXiv preprint arXiv:2503.09277 (2025)

  47. [47]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

  48. [48]

    arXiv preprint arXiv:2505.17412 (2025)

    Wu, S., Lin, Y., Zhang, F., Zeng, Y., Yang, Y., Bao, Y., Qian, J., Zhu, S., Cao, X., Torr, P., et al.: Direct3D-s2: Gigascale 3D generation made easy with spatial sparse attention. arXiv preprint arXiv:2505.17412 (2025)

  49. [49]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3D latents for scalable and versatile 3D generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 21469–21480 (2025)

  50. [50]

    InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

    Xu, J., Cheng, W., Gao, Y., Wang, X., Gao, S., Shan, Y.: Instantmesh: Efficient 3D mesh generation from a single image with sparse-view large reconstruction models. arXiv preprint arXiv:2404.07191 (2024) 18 H. Ji et al

  51. [51]

    In: European Conference on Computer Vision

    Xu, Y., Shi, Z., Yifan, W., Chen, H., Yang, C., Peng, S., Shen, Y., Wetzstein, G.: Grm: Large gaussian reconstruction model for efficient 3D reconstruction and gen- eration. In: European Conference on Computer Vision. pp. 1–20. Springer (2024)

  52. [52]

    ACM Transactions On Graphics (TOG)42(4), 1–16 (2023)

    Zhang, B., Tang, J., Niessner, M., Wonka, P.: 3dshape2vecset: A 3D shape repre- sentation for neural fields and generative diffusion models. ACM Transactions On Graphics (TOG)42(4), 1–16 (2023)

  53. [53]

    Advances in Neural Information Processing Systems37, 55761–55784 (2024)

    Zhang, C., Song, H., Wei, Y., Yu, C., Lu, J., Tang, Y.: Geolrm: Geometry-aware large reconstruction model for high-quality 3D gaussian generation. Advances in Neural Information Processing Systems37, 55761–55784 (2024)

  54. [54]

    Zhang, L., Wang, Z., Zhang, Q., Qiu, Q., Pang, A., Jiang, H., Yang, W., Xu, L., Yu, J.: Clay: A controllable large-scale generative model for creating high-quality 3D assets. ACM Transactions on Graphics (TOG)43(4), 1–20 (2024) FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation (Supplementary Materials) 1 Additional ...