CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models
Pith reviewed 2026-05-21 14:43 UTC · model grok-4.3
The pith
CG-MLLM creates high-resolution 3D objects and captions them inside a single multi-modal LLM by decoupling token and block modeling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CG-MLLM is a multi-modal large language model that performs 3D captioning and high-resolution 3D generation together. It uses a Mixture-of-Transformer architecture in which the Token-level Autoregressive Transformer processes token-level content and the Block-level Autoregressive Transformer processes block-level content. Integration of a pre-trained vision-language backbone with a specialized 3D VAE latent space supports long-context interactions between standard tokens and spatial blocks without loss of resolution or coherence.
What carries the argument
Mixture-of-Transformer architecture with TokenAR and BlockAR components that separate token-level and block-level autoregressive modeling while linking a vision-language backbone to 3D VAE latent space.
If this is right
- High-resolution 3D content creation enters the standard multi-modal LLM workflow.
- A single model handles both describing existing 3D scenes and producing new ones.
- Training for 3D generation strengthens the model's perception of 3D structure from 2D images.
- Existing MLLMs can be extended to output detailed 3D objects rather than coarse proxies.
Where Pith is reading between the lines
- The same split between token and block modeling could be tested on video or point-cloud generation.
- Natural-language 3D design tools might become practical by routing text and image inputs through this architecture.
- Bidirectional gains between generation and perception may appear in other multimodal settings such as audio-visual models.
Load-bearing premise
Combining a pre-trained vision-language backbone with a 3D VAE latent space inside the Mixture-of-Transformer will preserve fine-grained geometry and long-context coherence.
What would settle it
Compare the geometric fidelity of 3D meshes generated by CG-MLLM against ground-truth high-resolution objects on complex prompts; check whether image-based 3D understanding accuracy rises after the generation training stage.
Figures
read the original abstract
Large Language Models(LLMs) have revolutionized text generation and multimodal perception,but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either low-resolution meshes or coarse structural proxies, failing to capture finegrained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-ofTransformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles blocklevel content. By integrating a pre-trained visionlanguage backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks within a single integrated architecture. Experimental results show that CG-MLLM significantly outperforms existing MLLMs in generating high-fidelity 3D objects, effectively bringing high-resolution 3D content creation into the mainstream LLM paradigm. Beyond generation, we further observe that learning to produce 3D content transfers back to perception, strengthening the model's image-based 3D understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CG-MLLM, a multi-modal large language model for simultaneous 3D captioning and high-resolution 3D object generation. It employs a Mixture-of-Transformer architecture that decouples token-level modeling via TokenAR and block-level modeling via BlockAR, integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space to support long-context interactions and fine-grained geometry. The central claims are that this yields significant outperformance over existing MLLMs in high-fidelity 3D generation and that 3D generation training transfers positively to image-based 3D perception.
Significance. If the integration and results hold, the work would advance the extension of LLM paradigms to native high-resolution 3D content creation, offering a unified framework that avoids the low-resolution or proxy compromises of prior methods. The TokenAR/BlockAR decoupling and bidirectional transfer to perception represent potentially valuable architectural and training insights for multimodal models.
major comments (2)
- [Method] Method section (architecture description): The claim that the 3D VAE latent space, when processed by the Mixture-of-Transformer (TokenAR for tokens and BlockAR for blocks), captures fine-grained geometry without resolution loss or coherence failure is load-bearing for the central generation claim. However, no details are supplied on VAE architecture, latent dimension, spatial block tokenization, or training objective, leaving open the standard risk that VAE outputs are overly smooth and unrecoverable by the subsequent autoregressive stages.
- [Experiments] Experimental results: The assertion of significant outperformance and beneficial transfer to perception is central yet unsupported by any referenced metrics, baselines, ablation studies, or dataset details. This absence prevents evaluation of whether the architecture delivers the claimed high-fidelity results or merely re-expresses prior fitted behaviors.
minor comments (2)
- [Abstract] Abstract: Typographical issues include missing spaces and hyphens ('Large Language Models(LLMs)', 'finegrained', 'blocklevel', 'block-level content').
- [Abstract] Abstract: The summary of results would be strengthened by at least one concrete quantitative improvement or dataset reference to ground the outperformance claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to provide the requested details and supporting evidence.
read point-by-point responses
-
Referee: [Method] Method section (architecture description): The claim that the 3D VAE latent space, when processed by the Mixture-of-Transformer (TokenAR for tokens and BlockAR for blocks), captures fine-grained geometry without resolution loss or coherence failure is load-bearing for the central generation claim. However, no details are supplied on VAE architecture, latent dimension, spatial block tokenization, or training objective, leaving open the standard risk that VAE outputs are overly smooth and unrecoverable by the subsequent autoregressive stages.
Authors: We agree that the original submission omitted critical implementation details for the 3D VAE. In the revised manuscript we have added a dedicated subsection to the Method section that specifies the VAE architecture, latent dimension, spatial block tokenization scheme, and training objective. These additions clarify how the latent representation preserves fine-grained geometry and interfaces with the TokenAR and BlockAR stages. revision: yes
-
Referee: [Experiments] Experimental results: The assertion of significant outperformance and beneficial transfer to perception is central yet unsupported by any referenced metrics, baselines, ablation studies, or dataset details. This absence prevents evaluation of whether the architecture delivers the claimed high-fidelity results or merely re-expresses prior fitted behaviors.
Authors: We acknowledge that the experimental reporting in the initial version was insufficient. The revised Experiments section now includes quantitative metrics, explicit baseline comparisons, ablation studies on the TokenAR/BlockAR and VAE components, and complete dataset descriptions to substantiate the claims of high-fidelity generation and positive transfer to perception tasks. revision: yes
Circularity Check
No circularity: new architecture proposed without reduction to fitted inputs or self-citations
full rationale
The paper introduces CG-MLLM as a novel MLLM architecture that integrates a pre-trained vision-language backbone with a specialized 3D VAE latent space inside a Mixture-of-Transformer design (TokenAR for token-level and BlockAR for block-level content). No equations, derivations, or first-principles results are presented that reduce the claimed high-fidelity 3D generation performance to quantities defined by the model's own fitted parameters or prior self-citations. The abstract and described components frame the integration as an empirical construction validated by experiments, with no self-definitional loops, fitted-input predictions, or load-bearing uniqueness theorems from overlapping author work. The central claim therefore stands as an independent architectural proposal rather than a re-expression of its inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- TokenAR and BlockAR layer counts and attention configurations
axioms (1)
- domain assumption A pre-trained vision-language backbone can be fused with a 3D VAE latent space to support long-context 3D interactions
invented entities (1)
-
CG-MLLM
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Leveraging the Mixture-of-Transformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles block-level content. By integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
3D Tokenization. To enable the perception and generation of 3D content, we integrate a pre-trained Spatial-VAE adapted from Hunyuan3D-2.1 [20]. This component extracts point clouds ... downsampling factor of 20 and a latent dimension of 64.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers
EVA01 introduces a Mixture-of-Transformers model that natively adds 3D mesh understanding, generation, and multi-turn editing to MLLMs by decoupling understanding and generation experts with shared global self-attention.
Reference graph
Works this paper leans on
-
[1]
Kimi Team et al.Kimi K2: Open Agentic Intelligence. 2025.doi: 10.48550/arXiv.2507. 20534
-
[2]
Aaron Grattafiori et al.The Llama 3 Herd of Models. 2024.doi: 10.48550/arXiv.2407. 21783. 12
-
[3]
DeepSeek-AI et al.DeepSeek-V3 Technical Report. 2025.doi: 10.48550/arXiv.2412. 19437
-
[4]
OpenAI et al.GPT-4 Technical Report. 2024.doi:10.48550/arXiv.2303.08774
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2024
-
[5]
Qwen et al.Qwen2.5 Technical Report. 2025.doi:10.48550/arXiv.2412.15115
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2025
-
[6]
An Yang et al.Qwen3 Technical Report. 2025.doi:10.48550/arXiv.2505.09388
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[7]
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Chunting Zhou et al.Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model. 2024.doi:10.48550/arXiv.2408.11039
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.11039 2024
-
[8]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen et al.Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. 2025.doi:10.48550/arXiv.2501.17811
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.17811 2025
-
[9]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team.Chameleon: Mixed-Modal Early-Fusion Foundation Models. 2025.doi: 10.48550/arXiv.2405.09818
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.09818 2025
-
[10]
Gheorghe Comanici et al.Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. 2025.doi: 10. 48550/arXiv.2507.06261
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Emu3.5: Native Multimodal Models are World Learners
Yufeng Cui et al.Emu3.5: Native Multimodal Models Are World Learners. 2025.doi: 10.48550/arXiv.2510.26583
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.26583 2025
-
[12]
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou.Show-O2: Improved Native Unified Multimodal Models. 2025.doi:10.48550/arXiv.2506.15564
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.15564 2025
-
[13]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng et al.Emerging Properties in Unified Multimodal Pretraining. 2025.doi: 10.48550/arXiv.2505.14683
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.14683 2025
-
[14]
MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh
Shuangkang Fang et al. “MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh”. In: ()
-
[15]
Llama-mesh: Unifying 3d mesh generation with language models.arXiv preprint arXiv:2411.09595, 2024
Zhengyi Wang et al.LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models. 2024.doi:10.48550/arXiv.2411.09595
-
[16]
2025.doi:10.48550/arXiv.2508.14879
Bingquan Dai et al.MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds. 2025.doi:10.48550/arXiv.2508.14879
-
[17]
2025.doi:10.48550/arXiv.2506.01853
Junliang Ye et al.ShapeLLM-omni: A Native Multimodal LLM for 3D Generation and Understanding. 2025.doi:10.48550/arXiv.2506.01853
-
[18]
2025.doi:10.48550/arXiv.2505.05469
Ava Pun et al.Generating Physically Stable and Buildable Brick Structures from Text. 2025.doi:10.48550/arXiv.2505.05469
-
[19]
Shuai Bai et al.Qwen3-VL Technical Report. 2025.doi:10.48550/arXiv.2511.21631
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.21631 2025
-
[20]
Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material
Team Hunyuan3D et al.Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material. 2025.doi:10.48550/arXiv.2506.15442
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.15442 2025
-
[21]
Jin Xu et al.Qwen3-Omni Technical Report. 2025.doi:10.48550/arXiv.2509.17765
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.17765 2025
-
[22]
Haoyu Lu et al.DeepSeek-VL: Towards Real-World Vision-Language Understanding. 2024. doi:10.48550/arXiv.2403.05525
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.05525 2024
-
[23]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang et al.Emu3: Next-token Prediction Is All You Need. 2024.doi: 10.48550/ arXiv.2409.18869
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
2025.doi:10.48550/arXiv.2411.16856
Yongwei Chen et al.SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE. 2025.doi:10.48550/arXiv.2411.16856. 13
-
[25]
2024.doi:10.48550/arXiv.2404.02905
Keyu Tian et al.Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction. 2024.doi:10.48550/arXiv.2404.02905
-
[26]
2023.doi:10.48550/arXiv.2311.17618
Fukun Yin et al.ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model. 2023.doi:10.48550/arXiv.2311.17618
-
[27]
2023.doi:10.48550/arXiv.2311.15475
Yawar Siddiqui et al.MeshGPT: Generating Triangle Meshes with Decoder-Only Trans- formers. 2023.doi:10.48550/arXiv.2311.15475
-
[28]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu.Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. 2022.doi: 10.48550/arXiv.2209.03003
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.03003 2022
-
[30]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser et al.Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. 2024.doi:10.48550/arXiv.2403.03206
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.03206 2024
-
[31]
Scalable Diffusion Models with Transformers
William Peebles and Saining Xie.Scalable Diffusion Models with Transformers. 2023.doi: 10.48550/arXiv.2212.09748
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.09748 2023
-
[32]
Chenfei Wu et al.Qwen-Image Technical Report. 2025.doi: 10.48550/arXiv.2508.02324
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.02324 2025
-
[33]
Team Wan et al.Wan: Open and Advanced Large-Scale Video Generative Models. 2025. doi:10.48550/arXiv.2503.20314
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.20314 2025
-
[34]
Bing Wu et al.HunyuanVideo 1.5 Technical Report. 2025.doi: 10.48550/arXiv.2511. 18870
-
[35]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong et al.CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers. 2022.doi:10.48550/arXiv.2205.15868
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.15868 2022
-
[36]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang et al.CogVideoX: Text-to-Video Diffusion Models with An Expert Trans- former. 2025.doi:10.48550/arXiv.2408.06072
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.06072 2025
-
[37]
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Guoqing Ma et al.Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model. 2025.doi:10.48550/arXiv.2502.10248
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.10248 2025
-
[38]
2023.doi:10.48550/arXiv.2212.04493
Yen-Chi Cheng et al.SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation. 2023.doi:10.48550/arXiv.2212.04493
-
[39]
Zibo Zhao et al. “Michelangelo: Conditional 3D Shape Generation Based on Shape-Image- Text Aligned Latent Representation”. In: ()
-
[40]
2024.doi:10.48550/arXiv.2406.13897
Longwen Zhang et al.CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets. 2024.doi:10.48550/arXiv.2406.13897
-
[41]
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
Zibo Zhao et al.Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation. 2025.doi:10.48550/arXiv.2501.12202
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12202 2025
-
[42]
Structured 3D Latents for Scalable and Versatile 3D Generation
Jianfeng Xiang et al.Structured 3D Latents for Scalable and Versatile 3D Generation. 2025.doi:10.48550/arXiv.2412.01506
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.01506 2025
-
[43]
2025.doi:10.48550/arXiv.2505.07747
Weiyu Li et al.Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets. 2025.doi:10.48550/arXiv.2505.07747
-
[44]
2025.doi:10.48550/arXiv.2405.14979
Weiyu Li et al.CraftsMan3D: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner. 2025.doi:10.48550/arXiv.2405.14979
-
[45]
Zhihao Li et al.Sparc3D: Sparse Representation and Construction for High-Resolution 3D Shapes Modeling. 2025.doi:10.48550/arXiv.2505.14521. 14
-
[46]
2025.doi:10.48550/ARXIV.2510.19944
Jiashi Feng et al.Seed3D 1.0: From Images to High-Fidelity Simulation-Ready 3D Assets. 2025.doi:10.48550/ARXIV.2510.19944
-
[47]
2024.doi:10.48550/arXiv.2309.11499
Runpei Dong et al.DreamLLM: Synergistic Multimodal Comprehension and Creation. 2024.doi:10.48550/arXiv.2309.11499
-
[48]
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
Yuying Ge et al.SEED-X: Multimodal Models with Unified Multi-granularity Comprehen- sion and Generation. 2025.doi:10.48550/arXiv.2404.14396
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14396 2025
-
[49]
2025.doi:10.48550/arXiv.2411.07975
Yiyang Ma et al.JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation. 2025.doi:10.48550/arXiv.2411.07975
-
[50]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie et al.Show-o: One Single Transformer to Unify Multimodal Understanding and Generation. 2025.doi:10.48550/arXiv.2408.12528
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.12528 2025
-
[51]
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
Weixin Liang et al. “Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models”. In:Trans. Mach. Learn. Res.2025 (2025).url: https: //openreview.net/forum?id=Nu6N69i8SB
work page 2025
-
[52]
2019.doi:10.48550/arXiv.1909.03341
Changhan Wang, Kyunghyun Cho, and Jiatao Gu.Neural Machine Translation with Byte-Level Subwords. 2019.doi:10.48550/arXiv.1909.03341
-
[53]
Michael Tschannen et al.SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features. 2025.doi: 10.48550/arXiv. 2502.14786
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025
-
[54]
Jianlin Su et al.RoFormer: Enhanced Transformer with Rotary Position Embedding. 2023. doi:10.48550/arXiv.2104.09864
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2104.09864 2023
-
[55]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie et al.GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. 2023.doi:10.48550/arXiv.2305.13245
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.13245 2023
-
[56]
Language Modeling with Gated Convolutional Networks
Yann N. Dauphin et al.Language Modeling with Gated Convolutional Networks. 2017. doi:10.48550/arXiv.1612.08083
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1612.08083 2017
-
[57]
Root Mean Square Layer Normalization
Biao Zhang and Rico Sennrich.Root Mean Square Layer Normalization. 2019.doi: 10.48550/arXiv.1910.07467
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1910.07467 2019
-
[58]
2023.doi: 10.48550/arXiv.2302.05442
Mostafa Dehghani et al.Scaling Vision Transformers to 22 Billion Parameters. 2023.doi: 10.48550/arXiv.2302.05442
-
[59]
2025.doi:10.48550/arXiv.2412.15188
Weijia Shi et al.LMFusion: Adapting Pretrained Language Models for Multimodal Gener- ation. 2025.doi:10.48550/arXiv.2412.15188
-
[60]
2024.doi:10.48550/arXiv.2406.04334
Lingchen Meng et al.DeepStack: Deeply Stacking Visual Tokens Is Surprisingly Simple and Effective for LMMs. 2024.doi:10.48550/arXiv.2406.04334
-
[61]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li et al.LLaVA-OneVision: Easy Visual Task Transfer.doi: 10.48550/arXiv.2408. 03326.url:http://arxiv.org/abs/2408.03326. Pre-published
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408
-
[62]
Objaverse: A Universe of Annotated 3D Objects
Matt Deitke et al. “Objaverse: A Universe of Annotated 3D Objects”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, pp. 13142– 13153
work page 2023
-
[63]
Objaverse++: Curated 3D Object Dataset with Quality Annotations
Chendi Lin et al. “Objaverse++: Curated 3D Object Dataset with Quality Annotations”. In:Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. Oct. 2025, pp. 6813–6822
work page 2025
-
[64]
2024.doi:10.48550/arXiv.2402.12225
Xuelin Qian et al.Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability. 2024.doi:10.48550/arXiv.2402.12225. 15
-
[65]
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
Martin Heusel et al.GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. 2018.doi:10.48550/arXiv.1706.08500
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.08500 2018
-
[66]
Miko laj Bi´ nkowski et al.Demystifying MMD GANs. 2021.doi: 10.48550/arXiv.1801. 01401
-
[67]
Point-E: A System for Generating 3D Point Clouds from Complex Prompts
Alex Nichol et al.Point-E: A System for Generating 3D Point Clouds from Complex Prompts. 2022.doi:10.48550/arXiv.2212.08751
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08751 2022
-
[68]
Jianyi Wang, Kelvin C. K. Chan, and Chen Change Loy.Exploring CLIP for Assessing the Look and Feel of Images. 2022.doi:10.48550/arXiv.2207.12396
-
[69]
2021.doi: 10.48550/ arXiv.2108.05997
Junjie Ke et al.MUSIQ: Multi-scale Image Quality Transformer. 2021.doi: 10.48550/ arXiv.2108.05997
-
[70]
2023.doi: 10.48550/arXiv.2310.06773
Junsheng Zhou et al.Uni3D: Exploring Unified 3D Representation at Scale. 2023.doi: 10.48550/arXiv.2310.06773
-
[71]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford et al.Learning Transferable Visual Models From Natural Language Supervi- sion. 2021.doi:10.48550/arXiv.2103.00020
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.00020 2021
-
[72]
Haotian Liu et al.Visual Instruction Tuning. 2023.doi:10.48550/arXiv.2304.08485
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.08485 2023
-
[73]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai et al.InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. 2023.doi:10.48550/arXiv.2305.06500
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.06500 2023
-
[74]
Yining Hong et al.3D-LLM: Injecting the 3D World into Large Language Models. 2023. doi:10.48550/arXiv.2307.12981
-
[75]
2024.doi:10.48550/arXiv.2308.16911
Runsen Xu et al.PointLLM: Empowering Large Language Models to Understand Point Clouds. 2024.doi:10.48550/arXiv.2308.16911. 16
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.