CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models

Chi Wang; Donglin Huang; Guangkai Xu; Hao Chen; Junming Huang; Letian Li; Qiang Dai; Weiwei Xu

arxiv: 2601.21798 · v2 · pith:UNAPRGEWnew · submitted 2026-01-29 · 💻 cs.CV

CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models

Junming Huang , Chi Wang , Letian Li , Guangkai Xu , Donglin Huang , Hao Chen , Qiang Dai , Weiwei Xu This is my paper

Pith reviewed 2026-05-21 14:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D content generationmulti-modal large language models3D captioninghigh-resolution 3DMixture-of-Transformer3D VAE

0 comments

The pith

CG-MLLM creates high-resolution 3D objects and captions them inside a single multi-modal LLM by decoupling token and block modeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models handle text and images well but typically produce only coarse or low-resolution 3D shapes. The paper introduces CG-MLLM to perform both 3D captioning and detailed 3D generation in one framework. It splits responsibilities between a TokenAR transformer for individual tokens and a BlockAR transformer for larger spatial blocks. A pre-trained vision-language backbone connects to a 3D VAE latent space so the model can manage long sequences while keeping geometric detail. Training on generation also improves the model's ability to understand 3D structure from ordinary images.

Core claim

CG-MLLM is a multi-modal large language model that performs 3D captioning and high-resolution 3D generation together. It uses a Mixture-of-Transformer architecture in which the Token-level Autoregressive Transformer processes token-level content and the Block-level Autoregressive Transformer processes block-level content. Integration of a pre-trained vision-language backbone with a specialized 3D VAE latent space supports long-context interactions between standard tokens and spatial blocks without loss of resolution or coherence.

What carries the argument

Mixture-of-Transformer architecture with TokenAR and BlockAR components that separate token-level and block-level autoregressive modeling while linking a vision-language backbone to 3D VAE latent space.

If this is right

High-resolution 3D content creation enters the standard multi-modal LLM workflow.
A single model handles both describing existing 3D scenes and producing new ones.
Training for 3D generation strengthens the model's perception of 3D structure from 2D images.
Existing MLLMs can be extended to output detailed 3D objects rather than coarse proxies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same split between token and block modeling could be tested on video or point-cloud generation.
Natural-language 3D design tools might become practical by routing text and image inputs through this architecture.
Bidirectional gains between generation and perception may appear in other multimodal settings such as audio-visual models.

Load-bearing premise

Combining a pre-trained vision-language backbone with a 3D VAE latent space inside the Mixture-of-Transformer will preserve fine-grained geometry and long-context coherence.

What would settle it

Compare the geometric fidelity of 3D meshes generated by CG-MLLM against ground-truth high-resolution objects on complex prompts; check whether image-based 3D understanding accuracy rises after the generation training stage.

Figures

Figures reproduced from arXiv: 2601.21798 by Chi Wang, Donglin Huang, Guangkai Xu, Hao Chen, Junming Huang, Letian Li, Qiang Dai, Weiwei Xu.

**Figure 1.** Figure 1: The Pipeline of CG-MLLM. Our multimodal architecture processes vision, text, and 3D spatial inputs to generate text and 3D spatial outputs. It features a TokenAR Transformer for sequential next-token prediction and a BlockAR Transformer for efficient parallel block prediction, both governed by strict causal masking. 2. Related Work 2.1 Autoregressive Models Large Language Models (LLMs) [1, 2, 3, 4, 5, 6] r… view at source ↗

**Figure 2.** Figure 2: Our approach unifies spatial perception and generation in a single model, supporting [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Example mask used in CG-MLLM. size of 151,669. Visual Tokenization. Following the architecture of Qwen3-VL [19], we leverage a SigLIP2 [53] encoder for image feature extraction. To accommodate various input resolutions, we adopt its strategy of employing 2D-RoPE [54] and interpolating absolute position embeddings. Furthermore, a two-layer MLP is utilized to compress 2 × 2 visual features into a single vis… view at source ↗

**Figure 4.** Figure 4: Comparison with other MLLM-based methods on the image-to-3D task. For clearer [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: More Image-to-3D results produced by our method. For clearer visualization of [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of different strategies. (a) Training MSE loss comparison w/ and w/o [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Caption results. Compared to the ground truth from the point cloud perception dataset [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Failure Cases. (a) Common: ambiguous hints often lead to inaccurate outputs. (b) [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

Large Language Models(LLMs) have revolutionized text generation and multimodal perception,but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either low-resolution meshes or coarse structural proxies, failing to capture finegrained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-ofTransformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles blocklevel content. By integrating a pre-trained visionlanguage backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks within a single integrated architecture. Experimental results show that CG-MLLM significantly outperforms existing MLLMs in generating high-fidelity 3D objects, effectively bringing high-resolution 3D content creation into the mainstream LLM paradigm. Beyond generation, we further observe that learning to produce 3D content transfers back to perception, strengthening the model's image-based 3D understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CG-MLLM offers a TokenAR/BlockAR split in a Mixture-of-Transformer to bring high-resolution 3D generation into MLLMs alongside captioning, though the abstract shows no supporting metrics or details.

read the letter

The main thing to know about this paper is that it describes CG-MLLM, which uses a Mixture-of-Transformer architecture to handle both 3D captioning and high-resolution 3D generation within a single multimodal large language model. It splits the autoregressive modeling into TokenAR for token-level content and BlockAR for block-level content, while integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space. What is actually new is the specific use of this TokenAR and BlockAR split to manage the disparate needs of token and spatial block content in 3D data. This allows for long-context interactions in one integrated architecture rather than relying on separate pipelines for generation and understanding. The paper does well in clearly stating the limitations of existing methods, which often produce either low-resolution meshes or coarse proxies that miss fine-grained geometry. It also highlights a potential benefit where training for 3D content generation can transfer back to improve the model's performance on image-based 3D perception tasks. This bidirectional aspect is a thoughtful addition if supported by the results. On the soft spots, the biggest issue is the lack of concrete evidence. The abstract claims that experimental results show significant outperformance over existing MLLMs, but it supplies no metrics, no baselines, no ablation studies, and no details on the datasets used. This leaves the central claims resting on unshown evidence, making it hard to assess the true effectiveness. The concern from the stress-test about whether the 3D VAE latent space can preserve fine-grained geometry and support coherence without loss of resolution is important, as many such latent spaces tend to produce smoothed outputs that subsequent models struggle to refine. The approach appears to build directly on established components like vision-language models and 3D VAEs, so the novelty is more in the combination and the AR split than in inventing entirely new primitives. The math and derivations are not detailed in the abstract, but the architecture is presented as a new construction. This paper is for researchers in computer vision and multimodal learning who are interested in extending large language models to handle 3D content creation and related perception tasks. A reader looking for fresh architectural ideas in 3D-aware MLLMs could get value from the description of how TokenAR and BlockAR are used to decouple modeling needs. It deserves a serious referee because the proposal is timely and the idea of unifying these capabilities in an MLLM has clear potential, even if the current version needs more detailed experiments and comparisons to fully evaluate its contributions.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CG-MLLM, a multi-modal large language model for simultaneous 3D captioning and high-resolution 3D object generation. It employs a Mixture-of-Transformer architecture that decouples token-level modeling via TokenAR and block-level modeling via BlockAR, integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space to support long-context interactions and fine-grained geometry. The central claims are that this yields significant outperformance over existing MLLMs in high-fidelity 3D generation and that 3D generation training transfers positively to image-based 3D perception.

Significance. If the integration and results hold, the work would advance the extension of LLM paradigms to native high-resolution 3D content creation, offering a unified framework that avoids the low-resolution or proxy compromises of prior methods. The TokenAR/BlockAR decoupling and bidirectional transfer to perception represent potentially valuable architectural and training insights for multimodal models.

major comments (2)

[Method] Method section (architecture description): The claim that the 3D VAE latent space, when processed by the Mixture-of-Transformer (TokenAR for tokens and BlockAR for blocks), captures fine-grained geometry without resolution loss or coherence failure is load-bearing for the central generation claim. However, no details are supplied on VAE architecture, latent dimension, spatial block tokenization, or training objective, leaving open the standard risk that VAE outputs are overly smooth and unrecoverable by the subsequent autoregressive stages.
[Experiments] Experimental results: The assertion of significant outperformance and beneficial transfer to perception is central yet unsupported by any referenced metrics, baselines, ablation studies, or dataset details. This absence prevents evaluation of whether the architecture delivers the claimed high-fidelity results or merely re-expresses prior fitted behaviors.

minor comments (2)

[Abstract] Abstract: Typographical issues include missing spaces and hyphens ('Large Language Models(LLMs)', 'finegrained', 'blocklevel', 'block-level content').
[Abstract] Abstract: The summary of results would be strengthened by at least one concrete quantitative improvement or dataset reference to ground the outperformance claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to provide the requested details and supporting evidence.

read point-by-point responses

Referee: [Method] Method section (architecture description): The claim that the 3D VAE latent space, when processed by the Mixture-of-Transformer (TokenAR for tokens and BlockAR for blocks), captures fine-grained geometry without resolution loss or coherence failure is load-bearing for the central generation claim. However, no details are supplied on VAE architecture, latent dimension, spatial block tokenization, or training objective, leaving open the standard risk that VAE outputs are overly smooth and unrecoverable by the subsequent autoregressive stages.

Authors: We agree that the original submission omitted critical implementation details for the 3D VAE. In the revised manuscript we have added a dedicated subsection to the Method section that specifies the VAE architecture, latent dimension, spatial block tokenization scheme, and training objective. These additions clarify how the latent representation preserves fine-grained geometry and interfaces with the TokenAR and BlockAR stages. revision: yes
Referee: [Experiments] Experimental results: The assertion of significant outperformance and beneficial transfer to perception is central yet unsupported by any referenced metrics, baselines, ablation studies, or dataset details. This absence prevents evaluation of whether the architecture delivers the claimed high-fidelity results or merely re-expresses prior fitted behaviors.

Authors: We acknowledge that the experimental reporting in the initial version was insufficient. The revised Experiments section now includes quantitative metrics, explicit baseline comparisons, ablation studies on the TokenAR/BlockAR and VAE components, and complete dataset descriptions to substantiate the claims of high-fidelity generation and positive transfer to perception tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: new architecture proposed without reduction to fitted inputs or self-citations

full rationale

The paper introduces CG-MLLM as a novel MLLM architecture that integrates a pre-trained vision-language backbone with a specialized 3D VAE latent space inside a Mixture-of-Transformer design (TokenAR for token-level and BlockAR for block-level content). No equations, derivations, or first-principles results are presented that reduce the claimed high-fidelity 3D generation performance to quantities defined by the model's own fitted parameters or prior self-citations. The abstract and described components frame the integration as an empirical construction validated by experiments, with no self-definitional loops, fitted-input predictions, or load-bearing uniqueness theorems from overlapping author work. The central claim therefore stands as an independent architectural proposal rather than a re-expression of its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Abstract-only review limits visibility into exact parameter counts and training details; the model depends on the effectiveness of pre-trained components and the architectural decoupling.

free parameters (1)

TokenAR and BlockAR layer counts and attention configurations
Architecture hyperparameters required to balance token-level and block-level modeling; values not stated in abstract.

axioms (1)

domain assumption A pre-trained vision-language backbone can be fused with a 3D VAE latent space to support long-context 3D interactions
Invoked when the abstract states that the integration facilitates interactions between standard tokens and spatial blocks.

invented entities (1)

CG-MLLM no independent evidence
purpose: Unified model for 3D captioning and high-resolution generation
New system proposed in the paper; no independent evidence of its performance is supplied in the abstract.

pith-pipeline@v0.9.0 · 5779 in / 1432 out tokens · 60651 ms · 2026-05-21T14:43:33.967836+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Leveraging the Mixture-of-Transformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles block-level content. By integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

3D Tokenization. To enable the perception and generation of 3D content, we integrate a pre-trained Spatial-VAE adapted from Hunyuan3D-2.1 [20]. This component extracts point clouds ... downsampling factor of 20 and a latent dimension of 64.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers
cs.CV 2026-05 unverdicted novelty 5.0

EVA01 introduces a Mixture-of-Transformers model that natively adds 3D mesh understanding, generation, and multi-turn editing to MLLMs by decoupling understanding and generation experts with shared global self-attention.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · cited by 1 Pith paper · 38 internal anchors

[1]

2025.doi: 10.48550/arXiv.2507

Kimi Team et al.Kimi K2: Open Agentic Intelligence. 2025.doi: 10.48550/arXiv.2507. 20534

work page doi:10.48550/arxiv.2507 2025
[2]

2024.doi: 10.48550/arXiv.2407

Aaron Grattafiori et al.The Llama 3 Herd of Models. 2024.doi: 10.48550/arXiv.2407. 21783. 12

work page doi:10.48550/arxiv.2407 2024
[3]

2025.doi: 10.48550/arXiv.2412

DeepSeek-AI et al.DeepSeek-V3 Technical Report. 2025.doi: 10.48550/arXiv.2412. 19437

work page doi:10.48550/arxiv.2412 2025
[4]

GPT-4 Technical Report

OpenAI et al.GPT-4 Technical Report. 2024.doi:10.48550/arXiv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2024
[5]

Qwen2.5 Technical Report

Qwen et al.Qwen2.5 Technical Report. 2025.doi:10.48550/arXiv.2412.15115

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2025
[6]

Qwen3 Technical Report

An Yang et al.Qwen3 Technical Report. 2025.doi:10.48550/arXiv.2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[7]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Chunting Zhou et al.Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model. 2024.doi:10.48550/arXiv.2408.11039

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.11039 2024
[8]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen et al.Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. 2025.doi:10.48550/arXiv.2501.17811

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.17811 2025
[9]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team.Chameleon: Mixed-Modal Early-Fusion Foundation Models. 2025.doi: 10.48550/arXiv.2405.09818

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.09818 2025
[10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici et al.Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. 2025.doi: 10. 48550/arXiv.2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Emu3.5: Native Multimodal Models are World Learners

Yufeng Cui et al.Emu3.5: Native Multimodal Models Are World Learners. 2025.doi: 10.48550/arXiv.2510.26583

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.26583 2025
[12]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou.Show-O2: Improved Native Unified Multimodal Models. 2025.doi:10.48550/arXiv.2506.15564

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.15564 2025
[13]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng et al.Emerging Properties in Unified Multimodal Pretraining. 2025.doi: 10.48550/arXiv.2505.14683

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.14683 2025
[14]

MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh

Shuangkang Fang et al. “MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh”. In: ()

work page
[15]

Llama-mesh: Unifying 3d mesh generation with language models.arXiv preprint arXiv:2411.09595, 2024

Zhengyi Wang et al.LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models. 2024.doi:10.48550/arXiv.2411.09595

work page doi:10.48550/arxiv.2411.09595 2024
[16]

2025.doi:10.48550/arXiv.2508.14879

Bingquan Dai et al.MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds. 2025.doi:10.48550/arXiv.2508.14879

work page doi:10.48550/arxiv.2508.14879 2025
[17]

2025.doi:10.48550/arXiv.2506.01853

Junliang Ye et al.ShapeLLM-omni: A Native Multimodal LLM for 3D Generation and Understanding. 2025.doi:10.48550/arXiv.2506.01853

work page doi:10.48550/arxiv.2506.01853 2025
[18]

2025.doi:10.48550/arXiv.2505.05469

Ava Pun et al.Generating Physically Stable and Buildable Brick Structures from Text. 2025.doi:10.48550/arXiv.2505.05469

work page doi:10.48550/arxiv.2505.05469 2025
[19]

Qwen3-VL Technical Report

Shuai Bai et al.Qwen3-VL Technical Report. 2025.doi:10.48550/arXiv.2511.21631

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.21631 2025
[20]

Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material

Team Hunyuan3D et al.Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material. 2025.doi:10.48550/arXiv.2506.15442

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.15442 2025
[21]

Qwen3-Omni Technical Report

Jin Xu et al.Qwen3-Omni Technical Report. 2025.doi:10.48550/arXiv.2509.17765

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.17765 2025
[22]

Haoyu Lu et al.DeepSeek-VL: Towards Real-World Vision-Language Understanding. 2024. doi:10.48550/arXiv.2403.05525

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.05525 2024
[23]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang et al.Emu3: Next-token Prediction Is All You Need. 2024.doi: 10.48550/ arXiv.2409.18869

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

2025.doi:10.48550/arXiv.2411.16856

Yongwei Chen et al.SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE. 2025.doi:10.48550/arXiv.2411.16856. 13

work page doi:10.48550/arxiv.2411.16856 2025
[25]

2024.doi:10.48550/arXiv.2404.02905

Keyu Tian et al.Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction. 2024.doi:10.48550/arXiv.2404.02905

work page doi:10.48550/arxiv.2404.02905 2024
[26]

2023.doi:10.48550/arXiv.2311.17618

Fukun Yin et al.ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model. 2023.doi:10.48550/arXiv.2311.17618

work page doi:10.48550/arxiv.2311.17618 2023
[27]

2023.doi:10.48550/arXiv.2311.15475

Yawar Siddiqui et al.MeshGPT: Generating Triangle Meshes with Decoder-Only Trans- formers. 2023.doi:10.48550/arXiv.2311.15475

work page doi:10.48550/arxiv.2311.15475 2023
[28]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu.Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. 2022.doi: 10.48550/arXiv.2209.03003

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.03003 2022
[30]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser et al.Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. 2024.doi:10.48550/arXiv.2403.03206

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.03206 2024
[31]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie.Scalable Diffusion Models with Transformers. 2023.doi: 10.48550/arXiv.2212.09748

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.09748 2023
[32]

Qwen-Image Technical Report

Chenfei Wu et al.Qwen-Image Technical Report. 2025.doi: 10.48550/arXiv.2508.02324

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.02324 2025
[33]

Team Wan et al.Wan: Open and Advanced Large-Scale Video Generative Models. 2025. doi:10.48550/arXiv.2503.20314

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.20314 2025
[34]

2025.doi: 10.48550/arXiv.2511

Bing Wu et al.HunyuanVideo 1.5 Technical Report. 2025.doi: 10.48550/arXiv.2511. 18870

work page doi:10.48550/arxiv.2511 2025
[35]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong et al.CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers. 2022.doi:10.48550/arXiv.2205.15868

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.15868 2022
[36]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang et al.CogVideoX: Text-to-Video Diffusion Models with An Expert Trans- former. 2025.doi:10.48550/arXiv.2408.06072

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.06072 2025
[37]

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Guoqing Ma et al.Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model. 2025.doi:10.48550/arXiv.2502.10248

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.10248 2025
[38]

2023.doi:10.48550/arXiv.2212.04493

Yen-Chi Cheng et al.SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation. 2023.doi:10.48550/arXiv.2212.04493

work page doi:10.48550/arxiv.2212.04493 2023
[39]

Michelangelo: Conditional 3D Shape Generation Based on Shape-Image- Text Aligned Latent Representation

Zibo Zhao et al. “Michelangelo: Conditional 3D Shape Generation Based on Shape-Image- Text Aligned Latent Representation”. In: ()

work page
[40]

2024.doi:10.48550/arXiv.2406.13897

Longwen Zhang et al.CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets. 2024.doi:10.48550/arXiv.2406.13897

work page doi:10.48550/arxiv.2406.13897 2024
[41]

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Zibo Zhao et al.Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation. 2025.doi:10.48550/arXiv.2501.12202

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12202 2025
[42]

Structured 3D Latents for Scalable and Versatile 3D Generation

Jianfeng Xiang et al.Structured 3D Latents for Scalable and Versatile 3D Generation. 2025.doi:10.48550/arXiv.2412.01506

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.01506 2025
[43]

2025.doi:10.48550/arXiv.2505.07747

Weiyu Li et al.Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets. 2025.doi:10.48550/arXiv.2505.07747

work page doi:10.48550/arxiv.2505.07747 2025
[44]

2025.doi:10.48550/arXiv.2405.14979

Weiyu Li et al.CraftsMan3D: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner. 2025.doi:10.48550/arXiv.2405.14979

work page doi:10.48550/arxiv.2405.14979 2025
[45]

Sparc: Sparse representation and construc- tion for high-resolution 3d shapes modeling.arXiv preprint arXiv:2505.14521, 2025

Zhihao Li et al.Sparc3D: Sparse Representation and Construction for High-Resolution 3D Shapes Modeling. 2025.doi:10.48550/arXiv.2505.14521. 14

work page doi:10.48550/arxiv.2505.14521 2025
[46]

2025.doi:10.48550/ARXIV.2510.19944

Jiashi Feng et al.Seed3D 1.0: From Images to High-Fidelity Simulation-Ready 3D Assets. 2025.doi:10.48550/ARXIV.2510.19944

work page doi:10.48550/arxiv.2510.19944 2025
[47]

2024.doi:10.48550/arXiv.2309.11499

Runpei Dong et al.DreamLLM: Synergistic Multimodal Comprehension and Creation. 2024.doi:10.48550/arXiv.2309.11499

work page doi:10.48550/arxiv.2309.11499 2024
[48]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge et al.SEED-X: Multimodal Models with Unified Multi-granularity Comprehen- sion and Generation. 2025.doi:10.48550/arXiv.2404.14396

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14396 2025
[49]

2025.doi:10.48550/arXiv.2411.07975

Yiyang Ma et al.JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation. 2025.doi:10.48550/arXiv.2411.07975

work page doi:10.48550/arxiv.2411.07975 2025
[50]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie et al.Show-o: One Single Transformer to Unify Multimodal Understanding and Generation. 2025.doi:10.48550/arXiv.2408.12528

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.12528 2025
[51]

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Weixin Liang et al. “Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models”. In:Trans. Mach. Learn. Res.2025 (2025).url: https: //openreview.net/forum?id=Nu6N69i8SB

work page 2025
[52]

2019.doi:10.48550/arXiv.1909.03341

Changhan Wang, Kyunghyun Cho, and Jiatao Gu.Neural Machine Translation with Byte-Level Subwords. 2019.doi:10.48550/arXiv.1909.03341

work page doi:10.48550/arxiv.1909.03341 2019
[53]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen et al.SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features. 2025.doi: 10.48550/arXiv. 2502.14786

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025
[54]

Jianlin Su et al.RoFormer: Enhanced Transformer with Rotary Position Embedding. 2023. doi:10.48550/arXiv.2104.09864

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2104.09864 2023
[55]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie et al.GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. 2023.doi:10.48550/arXiv.2305.13245

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.13245 2023
[56]

Language Modeling with Gated Convolutional Networks

Yann N. Dauphin et al.Language Modeling with Gated Convolutional Networks. 2017. doi:10.48550/arXiv.1612.08083

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1612.08083 2017
[57]

Root Mean Square Layer Normalization

Biao Zhang and Rico Sennrich.Root Mean Square Layer Normalization. 2019.doi: 10.48550/arXiv.1910.07467

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1910.07467 2019
[58]

2023.doi: 10.48550/arXiv.2302.05442

Mostafa Dehghani et al.Scaling Vision Transformers to 22 Billion Parameters. 2023.doi: 10.48550/arXiv.2302.05442

work page doi:10.48550/arxiv.2302.05442 2023
[59]

2025.doi:10.48550/arXiv.2412.15188

Weijia Shi et al.LMFusion: Adapting Pretrained Language Models for Multimodal Gener- ation. 2025.doi:10.48550/arXiv.2412.15188

work page doi:10.48550/arxiv.2412.15188 2025
[60]

2024.doi:10.48550/arXiv.2406.04334

Lingchen Meng et al.DeepStack: Deeply Stacking Visual Tokens Is Surprisingly Simple and Effective for LMMs. 2024.doi:10.48550/arXiv.2406.04334

work page doi:10.48550/arxiv.2406.04334 2024
[61]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li et al.LLaVA-OneVision: Easy Visual Task Transfer.doi: 10.48550/arXiv.2408. 03326.url:http://arxiv.org/abs/2408.03326. Pre-published

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408
[62]

Objaverse: A Universe of Annotated 3D Objects

Matt Deitke et al. “Objaverse: A Universe of Annotated 3D Objects”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, pp. 13142– 13153

work page 2023
[63]

Objaverse++: Curated 3D Object Dataset with Quality Annotations

Chendi Lin et al. “Objaverse++: Curated 3D Object Dataset with Quality Annotations”. In:Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. Oct. 2025, pp. 6813–6822

work page 2025
[64]

2024.doi:10.48550/arXiv.2402.12225

Xuelin Qian et al.Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability. 2024.doi:10.48550/arXiv.2402.12225. 15

work page doi:10.48550/arxiv.2402.12225 2024
[65]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel et al.GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. 2018.doi:10.48550/arXiv.1706.08500

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.08500 2018
[66]

2021.doi: 10.48550/arXiv.1801

Miko laj Bi´ nkowski et al.Demystifying MMD GANs. 2021.doi: 10.48550/arXiv.1801. 01401

work page doi:10.48550/arxiv.1801 2021
[67]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

Alex Nichol et al.Point-E: A System for Generating 3D Point Clouds from Complex Prompts. 2022.doi:10.48550/arXiv.2212.08751

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08751 2022
[68]

Jianyi Wang, Kelvin C. K. Chan, and Chen Change Loy.Exploring CLIP for Assessing the Look and Feel of Images. 2022.doi:10.48550/arXiv.2207.12396

work page doi:10.48550/arxiv.2207.12396 2022
[69]

2021.doi: 10.48550/ arXiv.2108.05997

Junjie Ke et al.MUSIQ: Multi-scale Image Quality Transformer. 2021.doi: 10.48550/ arXiv.2108.05997

work page arXiv 2021
[70]

2023.doi: 10.48550/arXiv.2310.06773

Junsheng Zhou et al.Uni3D: Exploring Unified 3D Representation at Scale. 2023.doi: 10.48550/arXiv.2310.06773

work page doi:10.48550/arxiv.2310.06773 2023
[71]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford et al.Learning Transferable Visual Models From Natural Language Supervi- sion. 2021.doi:10.48550/arXiv.2103.00020

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.00020 2021
[72]

Visual Instruction Tuning

Haotian Liu et al.Visual Instruction Tuning. 2023.doi:10.48550/arXiv.2304.08485

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.08485 2023
[73]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai et al.InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. 2023.doi:10.48550/arXiv.2305.06500

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.06500 2023
[74]

Yining Hong et al.3D-LLM: Injecting the 3D World into Large Language Models. 2023. doi:10.48550/arXiv.2307.12981

work page doi:10.48550/arxiv.2307.12981 2023
[75]

2024.doi:10.48550/arXiv.2308.16911

Runsen Xu et al.PointLLM: Empowering Large Language Models to Understand Point Clouds. 2024.doi:10.48550/arXiv.2308.16911. 16

work page doi:10.48550/arxiv.2308.16911 2024

[1] [1]

2025.doi: 10.48550/arXiv.2507

Kimi Team et al.Kimi K2: Open Agentic Intelligence. 2025.doi: 10.48550/arXiv.2507. 20534

work page doi:10.48550/arxiv.2507 2025

[2] [2]

2024.doi: 10.48550/arXiv.2407

Aaron Grattafiori et al.The Llama 3 Herd of Models. 2024.doi: 10.48550/arXiv.2407. 21783. 12

work page doi:10.48550/arxiv.2407 2024

[3] [3]

2025.doi: 10.48550/arXiv.2412

DeepSeek-AI et al.DeepSeek-V3 Technical Report. 2025.doi: 10.48550/arXiv.2412. 19437

work page doi:10.48550/arxiv.2412 2025

[4] [4]

GPT-4 Technical Report

OpenAI et al.GPT-4 Technical Report. 2024.doi:10.48550/arXiv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2024

[5] [5]

Qwen2.5 Technical Report

Qwen et al.Qwen2.5 Technical Report. 2025.doi:10.48550/arXiv.2412.15115

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2025

[6] [6]

Qwen3 Technical Report

An Yang et al.Qwen3 Technical Report. 2025.doi:10.48550/arXiv.2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025

[7] [7]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Chunting Zhou et al.Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model. 2024.doi:10.48550/arXiv.2408.11039

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.11039 2024

[8] [8]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen et al.Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. 2025.doi:10.48550/arXiv.2501.17811

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.17811 2025

[9] [9]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team.Chameleon: Mixed-Modal Early-Fusion Foundation Models. 2025.doi: 10.48550/arXiv.2405.09818

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.09818 2025

[10] [10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici et al.Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. 2025.doi: 10. 48550/arXiv.2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Emu3.5: Native Multimodal Models are World Learners

Yufeng Cui et al.Emu3.5: Native Multimodal Models Are World Learners. 2025.doi: 10.48550/arXiv.2510.26583

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.26583 2025

[12] [12]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou.Show-O2: Improved Native Unified Multimodal Models. 2025.doi:10.48550/arXiv.2506.15564

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.15564 2025

[13] [13]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng et al.Emerging Properties in Unified Multimodal Pretraining. 2025.doi: 10.48550/arXiv.2505.14683

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.14683 2025

[14] [14]

MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh

Shuangkang Fang et al. “MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh”. In: ()

work page

[15] [15]

Llama-mesh: Unifying 3d mesh generation with language models.arXiv preprint arXiv:2411.09595, 2024

Zhengyi Wang et al.LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models. 2024.doi:10.48550/arXiv.2411.09595

work page doi:10.48550/arxiv.2411.09595 2024

[16] [16]

2025.doi:10.48550/arXiv.2508.14879

Bingquan Dai et al.MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds. 2025.doi:10.48550/arXiv.2508.14879

work page doi:10.48550/arxiv.2508.14879 2025

[17] [17]

2025.doi:10.48550/arXiv.2506.01853

Junliang Ye et al.ShapeLLM-omni: A Native Multimodal LLM for 3D Generation and Understanding. 2025.doi:10.48550/arXiv.2506.01853

work page doi:10.48550/arxiv.2506.01853 2025

[18] [18]

2025.doi:10.48550/arXiv.2505.05469

Ava Pun et al.Generating Physically Stable and Buildable Brick Structures from Text. 2025.doi:10.48550/arXiv.2505.05469

work page doi:10.48550/arxiv.2505.05469 2025

[19] [19]

Qwen3-VL Technical Report

Shuai Bai et al.Qwen3-VL Technical Report. 2025.doi:10.48550/arXiv.2511.21631

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.21631 2025

[20] [20]

Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material

Team Hunyuan3D et al.Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material. 2025.doi:10.48550/arXiv.2506.15442

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.15442 2025

[21] [21]

Qwen3-Omni Technical Report

Jin Xu et al.Qwen3-Omni Technical Report. 2025.doi:10.48550/arXiv.2509.17765

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.17765 2025

[22] [22]

Haoyu Lu et al.DeepSeek-VL: Towards Real-World Vision-Language Understanding. 2024. doi:10.48550/arXiv.2403.05525

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.05525 2024

[23] [23]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang et al.Emu3: Next-token Prediction Is All You Need. 2024.doi: 10.48550/ arXiv.2409.18869

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

2025.doi:10.48550/arXiv.2411.16856

Yongwei Chen et al.SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE. 2025.doi:10.48550/arXiv.2411.16856. 13

work page doi:10.48550/arxiv.2411.16856 2025

[25] [25]

2024.doi:10.48550/arXiv.2404.02905

Keyu Tian et al.Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction. 2024.doi:10.48550/arXiv.2404.02905

work page doi:10.48550/arxiv.2404.02905 2024

[26] [26]

2023.doi:10.48550/arXiv.2311.17618

Fukun Yin et al.ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model. 2023.doi:10.48550/arXiv.2311.17618

work page doi:10.48550/arxiv.2311.17618 2023

[27] [27]

2023.doi:10.48550/arXiv.2311.15475

Yawar Siddiqui et al.MeshGPT: Generating Triangle Meshes with Decoder-Only Trans- formers. 2023.doi:10.48550/arXiv.2311.15475

work page doi:10.48550/arxiv.2311.15475 2023

[28] [28]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu.Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. 2022.doi: 10.48550/arXiv.2209.03003

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.03003 2022

[29] [30]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser et al.Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. 2024.doi:10.48550/arXiv.2403.03206

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.03206 2024

[30] [31]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie.Scalable Diffusion Models with Transformers. 2023.doi: 10.48550/arXiv.2212.09748

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.09748 2023

[31] [32]

Qwen-Image Technical Report

Chenfei Wu et al.Qwen-Image Technical Report. 2025.doi: 10.48550/arXiv.2508.02324

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.02324 2025

[32] [33]

Team Wan et al.Wan: Open and Advanced Large-Scale Video Generative Models. 2025. doi:10.48550/arXiv.2503.20314

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.20314 2025

[33] [34]

2025.doi: 10.48550/arXiv.2511

Bing Wu et al.HunyuanVideo 1.5 Technical Report. 2025.doi: 10.48550/arXiv.2511. 18870

work page doi:10.48550/arxiv.2511 2025

[34] [35]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong et al.CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers. 2022.doi:10.48550/arXiv.2205.15868

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.15868 2022

[35] [36]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang et al.CogVideoX: Text-to-Video Diffusion Models with An Expert Trans- former. 2025.doi:10.48550/arXiv.2408.06072

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.06072 2025

[36] [37]

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Guoqing Ma et al.Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model. 2025.doi:10.48550/arXiv.2502.10248

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.10248 2025

[37] [38]

2023.doi:10.48550/arXiv.2212.04493

Yen-Chi Cheng et al.SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation. 2023.doi:10.48550/arXiv.2212.04493

work page doi:10.48550/arxiv.2212.04493 2023

[38] [39]

Michelangelo: Conditional 3D Shape Generation Based on Shape-Image- Text Aligned Latent Representation

Zibo Zhao et al. “Michelangelo: Conditional 3D Shape Generation Based on Shape-Image- Text Aligned Latent Representation”. In: ()

work page

[39] [40]

2024.doi:10.48550/arXiv.2406.13897

Longwen Zhang et al.CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets. 2024.doi:10.48550/arXiv.2406.13897

work page doi:10.48550/arxiv.2406.13897 2024

[40] [41]

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Zibo Zhao et al.Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation. 2025.doi:10.48550/arXiv.2501.12202

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12202 2025

[41] [42]

Structured 3D Latents for Scalable and Versatile 3D Generation

Jianfeng Xiang et al.Structured 3D Latents for Scalable and Versatile 3D Generation. 2025.doi:10.48550/arXiv.2412.01506

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.01506 2025

[42] [43]

2025.doi:10.48550/arXiv.2505.07747

Weiyu Li et al.Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets. 2025.doi:10.48550/arXiv.2505.07747

work page doi:10.48550/arxiv.2505.07747 2025

[43] [44]

2025.doi:10.48550/arXiv.2405.14979

Weiyu Li et al.CraftsMan3D: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner. 2025.doi:10.48550/arXiv.2405.14979

work page doi:10.48550/arxiv.2405.14979 2025

[44] [45]

Sparc: Sparse representation and construc- tion for high-resolution 3d shapes modeling.arXiv preprint arXiv:2505.14521, 2025

Zhihao Li et al.Sparc3D: Sparse Representation and Construction for High-Resolution 3D Shapes Modeling. 2025.doi:10.48550/arXiv.2505.14521. 14

work page doi:10.48550/arxiv.2505.14521 2025

[45] [46]

2025.doi:10.48550/ARXIV.2510.19944

Jiashi Feng et al.Seed3D 1.0: From Images to High-Fidelity Simulation-Ready 3D Assets. 2025.doi:10.48550/ARXIV.2510.19944

work page doi:10.48550/arxiv.2510.19944 2025

[46] [47]

2024.doi:10.48550/arXiv.2309.11499

Runpei Dong et al.DreamLLM: Synergistic Multimodal Comprehension and Creation. 2024.doi:10.48550/arXiv.2309.11499

work page doi:10.48550/arxiv.2309.11499 2024

[47] [48]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge et al.SEED-X: Multimodal Models with Unified Multi-granularity Comprehen- sion and Generation. 2025.doi:10.48550/arXiv.2404.14396

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14396 2025

[48] [49]

2025.doi:10.48550/arXiv.2411.07975

Yiyang Ma et al.JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation. 2025.doi:10.48550/arXiv.2411.07975

work page doi:10.48550/arxiv.2411.07975 2025

[49] [50]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie et al.Show-o: One Single Transformer to Unify Multimodal Understanding and Generation. 2025.doi:10.48550/arXiv.2408.12528

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.12528 2025

[50] [51]

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Weixin Liang et al. “Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models”. In:Trans. Mach. Learn. Res.2025 (2025).url: https: //openreview.net/forum?id=Nu6N69i8SB

work page 2025

[51] [52]

2019.doi:10.48550/arXiv.1909.03341

Changhan Wang, Kyunghyun Cho, and Jiatao Gu.Neural Machine Translation with Byte-Level Subwords. 2019.doi:10.48550/arXiv.1909.03341

work page doi:10.48550/arxiv.1909.03341 2019

[52] [53]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen et al.SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features. 2025.doi: 10.48550/arXiv. 2502.14786

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025

[53] [54]

Jianlin Su et al.RoFormer: Enhanced Transformer with Rotary Position Embedding. 2023. doi:10.48550/arXiv.2104.09864

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2104.09864 2023

[54] [55]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie et al.GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. 2023.doi:10.48550/arXiv.2305.13245

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.13245 2023

[55] [56]

Language Modeling with Gated Convolutional Networks

Yann N. Dauphin et al.Language Modeling with Gated Convolutional Networks. 2017. doi:10.48550/arXiv.1612.08083

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1612.08083 2017

[56] [57]

Root Mean Square Layer Normalization

Biao Zhang and Rico Sennrich.Root Mean Square Layer Normalization. 2019.doi: 10.48550/arXiv.1910.07467

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1910.07467 2019

[57] [58]

2023.doi: 10.48550/arXiv.2302.05442

Mostafa Dehghani et al.Scaling Vision Transformers to 22 Billion Parameters. 2023.doi: 10.48550/arXiv.2302.05442

work page doi:10.48550/arxiv.2302.05442 2023

[58] [59]

2025.doi:10.48550/arXiv.2412.15188

Weijia Shi et al.LMFusion: Adapting Pretrained Language Models for Multimodal Gener- ation. 2025.doi:10.48550/arXiv.2412.15188

work page doi:10.48550/arxiv.2412.15188 2025

[59] [60]

2024.doi:10.48550/arXiv.2406.04334

Lingchen Meng et al.DeepStack: Deeply Stacking Visual Tokens Is Surprisingly Simple and Effective for LMMs. 2024.doi:10.48550/arXiv.2406.04334

work page doi:10.48550/arxiv.2406.04334 2024

[60] [61]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li et al.LLaVA-OneVision: Easy Visual Task Transfer.doi: 10.48550/arXiv.2408. 03326.url:http://arxiv.org/abs/2408.03326. Pre-published

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408

[61] [62]

Objaverse: A Universe of Annotated 3D Objects

Matt Deitke et al. “Objaverse: A Universe of Annotated 3D Objects”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, pp. 13142– 13153

work page 2023

[62] [63]

Objaverse++: Curated 3D Object Dataset with Quality Annotations

Chendi Lin et al. “Objaverse++: Curated 3D Object Dataset with Quality Annotations”. In:Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. Oct. 2025, pp. 6813–6822

work page 2025

[63] [64]

2024.doi:10.48550/arXiv.2402.12225

Xuelin Qian et al.Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability. 2024.doi:10.48550/arXiv.2402.12225. 15

work page doi:10.48550/arxiv.2402.12225 2024

[64] [65]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel et al.GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. 2018.doi:10.48550/arXiv.1706.08500

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.08500 2018

[65] [66]

2021.doi: 10.48550/arXiv.1801

Miko laj Bi´ nkowski et al.Demystifying MMD GANs. 2021.doi: 10.48550/arXiv.1801. 01401

work page doi:10.48550/arxiv.1801 2021

[66] [67]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

Alex Nichol et al.Point-E: A System for Generating 3D Point Clouds from Complex Prompts. 2022.doi:10.48550/arXiv.2212.08751

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08751 2022

[67] [68]

Jianyi Wang, Kelvin C. K. Chan, and Chen Change Loy.Exploring CLIP for Assessing the Look and Feel of Images. 2022.doi:10.48550/arXiv.2207.12396

work page doi:10.48550/arxiv.2207.12396 2022

[68] [69]

2021.doi: 10.48550/ arXiv.2108.05997

Junjie Ke et al.MUSIQ: Multi-scale Image Quality Transformer. 2021.doi: 10.48550/ arXiv.2108.05997

work page arXiv 2021

[69] [70]

2023.doi: 10.48550/arXiv.2310.06773

Junsheng Zhou et al.Uni3D: Exploring Unified 3D Representation at Scale. 2023.doi: 10.48550/arXiv.2310.06773

work page doi:10.48550/arxiv.2310.06773 2023

[70] [71]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford et al.Learning Transferable Visual Models From Natural Language Supervi- sion. 2021.doi:10.48550/arXiv.2103.00020

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.00020 2021

[71] [72]

Visual Instruction Tuning

Haotian Liu et al.Visual Instruction Tuning. 2023.doi:10.48550/arXiv.2304.08485

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.08485 2023

[72] [73]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai et al.InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. 2023.doi:10.48550/arXiv.2305.06500

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.06500 2023

[73] [74]

Yining Hong et al.3D-LLM: Injecting the 3D World into Large Language Models. 2023. doi:10.48550/arXiv.2307.12981

work page doi:10.48550/arxiv.2307.12981 2023

[74] [75]

2024.doi:10.48550/arXiv.2308.16911

Runsen Xu et al.PointLLM: Empowering Large Language Models to Understand Point Clouds. 2024.doi:10.48550/arXiv.2308.16911. 16

work page doi:10.48550/arxiv.2308.16911 2024