Recognition: 2 theorem links
· Lean TheoremMoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Pith reviewed 2026-05-16 02:28 UTC · model grok-4.3
The pith
A sparse vision-language model activates only 3 billion parameters yet matches the performance of a 7 billion parameter dense model on visual tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MoE-LLaVA is a sparse large vision-language model constructed via the MoE-Tuning strategy, which overcomes the performance degradation normally seen when sparsity is applied to multi-modal models. During inference only the top-k experts are activated through learned routers, keeping the remaining experts inactive and thereby holding computational cost constant regardless of total parameter count. With roughly 3B sparsely activated parameters the model reaches performance comparable to LLaVA-1.5-7B across visual understanding datasets and surpasses LLaVA-1.5-13B on object hallucination benchmarks.
What carries the argument
MoE-Tuning training strategy combined with router-based top-k expert selection that keeps only a fixed number of experts active per token.
If this is right
- Vision-language models can scale total parameters far beyond active compute without proportional increases in training or inference cost.
- Expert specialization under MoE-Tuning can reduce object hallucination more effectively than simply enlarging a dense model.
- Constant computational cost at inference time enables deployment of models with arbitrarily large expert pools as long as only top-k activation is used.
- The same sparsity pattern can be applied to other multi-modal tasks while retaining dense-model accuracy levels.
Where Pith is reading between the lines
- If routing decisions prove sensitive to input modality, the architecture could naturally support mixed vision-text-audio inputs with minimal extra overhead.
- Extending MoE-Tuning to video or 3D understanding would test whether the same stability holds when temporal or spatial structure is added.
- The fixed active-parameter budget suggests that future versions could increase total experts while keeping inference latency unchanged, provided router quality scales.
Load-bearing premise
The MoE-Tuning strategy reliably prevents the performance degradation that sparsity normally causes in multi-modal models.
What would settle it
Training an equivalent dense LLaVA-style model and an MoE-Tuned sparse version on the same data and observing that the sparse version falls measurably behind on the reported visual understanding and hallucination benchmarks would falsify the central claim.
read the original abstract
Recent advances demonstrate that scaling Large Vision-Language Models (LVLMs) effectively improves downstream task performances. However, existing scaling methods enable all model parameters to be active for each token in the calculation, which brings massive training and inferring costs. In this work, we propose a simple yet effective training strategy MoE-Tuning for LVLMs. This strategy innovatively addresses the common issue of performance degradation in multi-modal sparsity learning, consequently constructing a sparse model with an outrageous number of parameters but a constant computational cost. Furthermore, we present the MoE-LLaVA, a MoE-based sparse LVLM architecture, which uniquely activates only the top-k experts through routers during deployment, keeping the remaining experts inactive. Extensive experiments show the significant performance of MoE-LLaVA in a variety of visual understanding and object hallucination benchmarks. Remarkably, with only approximately 3B sparsely activated parameters, MoE-LLaVA demonstrates performance comparable to the LLaVA-1.5-7B on various visual understanding datasets and even surpasses the LLaVA-1.5-13B in object hallucination benchmark. Through MoE-LLaVA, we aim to establish a baseline for sparse LVLMs and provide valuable insights for future research in developing more efficient and effective multi-modal learning systems. Code is released at https://github.com/PKU-YuanGroup/MoE-LLaVA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes MoE-Tuning, a training strategy claimed to prevent performance degradation when applying sparsity to large vision-language models. It introduces the MoE-LLaVA architecture that activates only the top-k experts via routers at inference time. With approximately 3B sparsely activated parameters, the model is reported to match LLaVA-1.5-7B on visual understanding benchmarks and surpass LLaVA-1.5-13B on object hallucination benchmarks, while releasing code for reproducibility.
Significance. If the central claims hold after additional controls, the work would provide a practical baseline for sparse LVLMs that increase total parameter count without raising inference FLOPs. The public code release supports reproducibility and follow-on research in efficient multi-modal scaling.
major comments (2)
- [3.2] §3.2: The claim that MoE-Tuning addresses the common issue of performance degradation in multi-modal sparsity learning is load-bearing for the headline result, yet the manuscript provides no controlled ablation of an otherwise identical sparse architecture trained without the proposed MoE-Tuning schedule on the same VQA, GQA, and POPE splits.
- [4] §4: Benchmark tables report point estimates for MoE-LLaVA versus LLaVA-1.5-7B/13B without error bars, standard deviations across runs, or explicit confirmation that baseline models were re-implemented with identical data, token counts, and hyperparameters, preventing verification that the ~3B-active-parameter model truly retains dense-model capability.
minor comments (2)
- Abstract: The phrase 'outrageous number of parameters' is informal; replace with a precise statement of total versus active parameter counts.
- [3] Notation: The router and expert-capacity details in the architecture description would benefit from an explicit equation or pseudocode block to clarify top-k selection and load balancing.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our work. We address the major comments below and have made revisions to the manuscript accordingly.
read point-by-point responses
-
Referee: [3.2] §3.2: The claim that MoE-Tuning addresses the common issue of performance degradation in multi-modal sparsity learning is load-bearing for the headline result, yet the manuscript provides no controlled ablation of an otherwise identical sparse architecture trained without the proposed MoE-Tuning schedule on the same VQA, GQA, and POPE splits.
Authors: We agree that a controlled ablation would provide stronger support for the effectiveness of MoE-Tuning in preventing performance degradation. In the revised manuscript, we will include an ablation study comparing the performance of the sparse architecture trained with and without the MoE-Tuning schedule on the VQA, GQA, and POPE benchmarks. revision: yes
-
Referee: [4] §4: Benchmark tables report point estimates for MoE-LLaVA versus LLaVA-1.5-7B/13B without error bars, standard deviations across runs, or explicit confirmation that baseline models were re-implemented with identical data, token counts, and hyperparameters, preventing verification that the ~3B-active-parameter model truly retains dense-model capability.
Authors: The LLaVA-1.5 baseline results are taken from the original publication to ensure consistent evaluation protocols. We have revised Section 4 to explicitly state that our training setup, including data and hyperparameters, follows the LLaVA-1.5 configuration. Due to the substantial computational resources required for multiple training runs, we report single-run results; however, we will add standard deviations from repeated inference evaluations on the test sets in the revised tables. revision: partial
Circularity Check
No circularity: empirical architecture and benchmark results are self-contained
full rationale
The paper introduces an MoE architecture and MoE-Tuning training procedure for LVLMs, then reports measured performance on standard held-out visual understanding and hallucination benchmarks. No mathematical derivation chain exists that reduces a claimed prediction or first-principles result to its own inputs by construction. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on direct experimental outcomes rather than internal re-derivations, satisfying the criteria for a non-circular empirical study.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of experts
- top-k value
axioms (2)
- standard math Standard transformer attention and feed-forward blocks remain unchanged except for replacement of FFN with MoE layers.
- domain assumption Top-k expert selection via learned router produces stable training when combined with MoE-Tuning.
Forward citations
Cited by 19 Pith papers
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
-
On the Optimality of Hierarchical Secure Aggregation with Arbitrary Heterogeneous Data Assignment
A hierarchical secure aggregation scheme with arbitrary heterogeneous data assignment achieves optimal two-layer communication loads under information-theoretic security against collusions and dropouts.
-
VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading
VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.
-
Learngene Search Across Multiple Datasets for Building Variable-Sized Models
LSAMD searches a multi-dataset super Ans-Net to extract frequently selected base blocks as learngenes that initialize variable-sized Des-Nets with performance comparable to full pretrain-finetune at lower storage and ...
-
State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading
MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gaug...
-
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs
SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
-
Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis
Transferring a 2D MLLM to 3D CT inputs via parameter reuse, a Text-Guided Hierarchical MoE framework, and two-stage training yields better performance than prior 3D medical MLLMs on medical report generation and visua...
-
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
-
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
-
ImgEdit: A Unified Image Editing Dataset and Benchmark
ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
An Efficient Token Compression Framework for Visual Object Tracking
ETCTrack compresses template tokens by 60% in visual trackers via an adaptive compressor and hierarchical interaction, cutting MACs 21.4% with 0.4% accuracy drop on seven benchmarks.
-
SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection
SemLT3D introduces semantic-guided expert distillation with a language MoE module and CLIP projection to enrich features for long-tailed classes in camera-only 3D detection.
-
CoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question Answering
CoGR-MoE improves VQA by using concept-guided expert routing with option feature reweighting and contrastive learning to achieve consistent yet flexible reasoning across answer options.
-
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
-
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
-
PaliGemma: A versatile 3B VLM for transfer
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
Reference graph
Works this paper leans on
-
[1]
Adaptive Input Representations for Neural Language Modeling
Baevski, A. and Auli, M. Adaptive input representa- tions for neural language modeling. arXiv preprint arXiv:1809.10853,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A frontier large vision- language model with versatile abilities. arXiv preprint arXiv:2308.1...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,
work page 1901
-
[4]
arXiv preprint arXiv:2312.06742 (2023)
Cha, J., Kang, W., Mun, J., and Roh, B. Honeybee: Locality- enhanced projector for multimodal llm. arXiv preprint arXiv:2312.06742,
-
[5]
Eve: Efficient vision-language pre-training with masked prediction and modality-aware moe
Chen, J., Guo, L., Sun, J., Shao, S., Yuan, Z., Lin, L., and Zhang, D. Eve: Efficient vision-language pre-training with masked prediction and modality-aware moe. arXiv preprint arXiv:2308.11971, 2023a. Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krish- namoorthi, R., Chandra, V ., Xiong, Y ., and Elhoseiny, M. Minigpt-v2: large language model...
-
[6]
arXiv preprint arXiv:2312.16886 (2023)
Chu, X., Qiao, L., Lin, X., Xu, S., Yang, Y ., Hu, Y ., Wei, F., Zhang, X., Zhang, B., Wei, X., et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886,
-
[7]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Glm: General language model pretrain- ing with autoregressive blank infilling
Du, Z., Qian, Y ., Liu, X., Ding, M., Qiu, J., Yang, Z., and Tang, J. Glm: General language model pretrain- ing with autoregressive blank infilling. arXiv preprint arXiv:2103.10360,
-
[9]
Learning Factored Representations in a Deep Mixture of Experts
Eigen, D., Ranzato, M., and Sutskever, I. Learning fac- tored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y ., and Ji, R. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Gong, T., Lyu, C., Zhang, S., Wang, Y ., Zheng, M., Zhao, Q., Liu, K., Zhang, W., Luo, P., and Chen, K. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790,
-
[12]
Gou, Y ., Liu, Z., Chen, K., Hong, L., Xu, H., Li, A., Yeung, D.-Y ., Kwok, J. T., and Zhang, Y . Mixture of cluster- conditional lora experts for vision-language instruction tuning. arXiv preprint arXiv:2312.12379,
-
[13]
Gaussian Error Linear Units (GELUs)
Hendrycks, D. and Gimpel, K. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Y ., Salakhutdinov, R., and Fried, D
Koh, J. Y ., Salakhutdinov, R., and Fried, D. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823,
-
[15]
R., Mustafa, B., Ainslie, J., Tay, Y ., Dehghani, M., and Houlsby, N
Komatsuzaki, A., Puigcerver, J., Lee-Thorp, J., Ruiz, C. R., Mustafa, B., Ainslie, J., Tay, Y ., Dehghani, M., and Houlsby, N. Sparse upcycling: Training mixture- of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055,
-
[16]
Beyond distillation: Task- level mixture-of-experts for efficient inference
Kudugunta, S., Huang, Y ., Bapna, A., Krikun, M., Lepikhin, D., Luong, M.-T., and Firat, O. Beyond distillation: Task- level mixture-of-experts for efficient inference. arXiv preprint arXiv:2110.03742,
-
[17]
arXiv preprint arXiv:2308.00692 (2023)
11 MoE-LLaV A: Mixture of Experts for Large Vision-Language Models Lai, X., Tian, Z., Chen, Y ., Li, Y ., Yuan, Y ., Liu, S., and Jia, J. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692,
-
[18]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Lepikhin, D., Lee, H., Xu, Y ., Chen, D., Firat, O., Huang, Y ., Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668,
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[19]
arXiv preprint arXiv:2306.05425 (2023)
Li, B., Zhang, Y ., Chen, L., Wang, J., Pu, F., Yang, J., Li, C., and Liu, Z. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a. Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Confer- ence on Machine ...
-
[20]
Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Boot- strapping language-image pre-training with frozen im- age encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b. Li, X., Yao, Y ., Jiang, X., Fang, X., Meng, X., Fan, S., Han, P., Li, J., Du, L., Qin, B., et al. Flm-101b: An open llm and how to train it with 100 k budget. arXiv p...
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Lin, B., Zhu, B., Ye, Y ., Ning, M., Jin, P., and Yuan, L. Video-llava: Learning united visual representa- tion by alignment before projection. arXiv preprint arXiv:2311.10122,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y ., and Wang, L. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a. Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved base- lines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b. Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruc...
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond language
Liu, Z., He, Y ., Wang, W., Wang, W., Wang, Y ., Chen, S., Zhang, Q., Lai, Z., Yang, Y ., Li, Q., et al. Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond language. arXiv preprint arXiv:2305.05662, 3, 2023e. Long, Z., Killick, G., McCreadie, R., and Camarasa, G. A. Multiway-adapater: Adapting large-scale multi-modal models for sca...
-
[24]
Ma, G., Wu, X., Wang, P., and Hu, S. Cot-mote: Explor- ing contextual masked auto-encoder pre-training with mixture-of-textual-experts for passage retrieval. arXiv preprint arXiv:2304.10195,
-
[25]
12 MoE-LLaV A: Mixture of Experts for Large Vision-Language Models Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cap- pelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and Launay, J. The refinedweb dataset for falcon llm: out- performing curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Peng, Z., Wang, W., Dong, L., Hao, Y ., Huang, S., Ma, S., and Wei, F. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824,
work page internal anchor Pith review Pith/arXiv arXiv
- [27]
-
[28]
Glamm: Pixel grounding large multimodal model,
Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S., Cholakkal, H., Anwer, R. M., Xing, E., Yang, M.-H., and Khan, F. S. Glamm: Pixel grounding large multimodal model. arXiv preprint arXiv:2311.03356,
- [29]
-
[30]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ili´c, S., Hesslow, D., Castagn´e, R., Luccioni, A. S., Yvon, F., Gall ´e, M., et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Scaling vision-language models with sparse mixture of experts
Shen, S., Yao, Z., Li, C., Darrell, T., Keutzer, K., and He, Y . Scaling vision-language models with sparse mixture of experts. arXiv preprint arXiv:2303.07226,
-
[33]
Moss: Train- ing conversational language models from synthetic data
Sun, T., Zhang, X., He, Z., Li, P., Cheng, Q., Yan, H., Liu, X., Shao, Y ., Tang, Q., Zhao, X., et al. Moss: Train- ing conversational language models from synthetic data. arXiv preprint arXiv:2307.15020, 7,
-
[34]
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y ., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7,
work page 2023
-
[35]
Team, S. A. L. Stable lm 2 1.6b. URL [https://huggingface.co/stabilityai/ stablelm-2-1.6b](https://huggingface. co/stabilityai/stablelm-2-1.6b). Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models. arXiv preprint arXiv...
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[36]
K., Singhal, S., Som, S., et al
Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mohammed, O. K., Singhal, S., Som, S., et al. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442,
-
[37]
arXiv preprint arXiv:2305.11175 (2023)
Wang, W., Chen, Z., Chen, X., Wu, J., Zhu, X., Zeng, G., Luo, P., Lu, T., Zhou, J., Qiao, Y ., et al. Visionllm: Large language model is also an open-ended decoder for vision- centric tasks. arXiv preprint arXiv:2305.11175, 2023c. 13 MoE-LLaV A: Mixture of Experts for Large Vision-Language Models Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y ., Ji, ...
-
[38]
Baichuan 2: Open large-scale language models
Yang, A., Xiao, B., Wang, B., Zhang, B., Bian, C., Yin, C., Lv, C., Pan, D., Wang, D., Yan, D., et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305,
-
[39]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y ., Wang, J., Hu, A., Shi, P., Shi, Y ., et al. mplug-owl: Modulariza- tion empowers large language models with multimodality. arXiv preprint arXiv:2304.14178,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
A Survey on Multimodal Large Language Models
Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., and Chen, E. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549,
work page internal anchor Pith review arXiv
-
[41]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L. Mm-vet: Evaluating large multi- modal models for integrated capabilities. arXiv preprint arXiv:2308.02490,
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Tinygpt-v: Efficient multi- modal large language model via small backbones
Yuan, Z., Li, Z., and Sun, L. Tinygpt-v: Efficient multi- modal large language model via small backbones. arXiv preprint arXiv:2312.16862,
-
[43]
GLM-130B: An Open Bilingual Pre-trained Model
Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y ., Zheng, W., Xia, X., et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Zhang, P., Wang, X. D. B., Cao, Y ., Xu, C., Ouyang, L., Zhao, Z., Ding, S., Zhang, S., Duan, H., Yan, H., et al. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023a. Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Li...
-
[45]
Instruction tuning for large language models: A survey
Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wu, F., et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023b. Zhang, X. and Yang, Q. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters. In Proceedings of the 32nd ACM International Con...
-
[46]
Llavar: Enhanced visual instruction tuning for text-rich image understanding
Zhang, Y ., Zhang, R., Gu, J., Zhou, Y ., Lipka, N., Yang, D., and Sun, T. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023c. Zhao, B., Wu, B., and Huang, T. Svit: Scaling up vi- sual instruction tuning. arXiv preprint arXiv:2307.04087, 2023a. Zhao, Y ., Lin, Z., Zhou, D., Huang, Z., Feng,...
-
[47]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y ., Dean, J., Shazeer, N., and Fedus, W. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906,
work page internal anchor Pith review Pith/arXiv arXiv
-
[49]
“1.6B×4-Top2” represents a dense foundation model with 1.6B parameters, which will be equipped with a total of four experts, with two of them being activated. “†” donates all layers will equipped with MoE layer. Name Experts Top-k MoE Embedding Width Layers FFN FFN Heads Activated Total Layers Factor Param Param StableLM-1.6B (Team) - - - 100352 2560 32 1...
work page 2048
-
[50]
Training hyperparameters. Config Stage I Stage II Stage III Experts - - 4 Top-k - - 2 Deepspeed Zero2 Zero2 Zero2 offload Data LLaV A-PT Hybird-PT LLaV A-FT Image resolution 336×336 Image encoder CLIP-Large/336 Feature select layer -2 Image projector 2 Linear layers with GeLU Epoch 1 Learning rate 1e-3 2e-5 2e-5 Learning rate schdule Cosine Weight decay 0...
work page 2048
-
[51]
Ablation study about the model size of MoE-LLaV A. Model MoE VQAv2 SQAI VQAT MMB LLaV A W StableLM ✗ 74.5 62.0 48.8 58.2 83.2 ✔ 76.0 62.6 47.8 59.4 85.9 Qwen ✗ 74.9 60.2 48.3 60.6 86.3 ✔ 76.2 63.1 48.0 59.7 88.7 Phi-2 ✗ 75.6 67.8 50.0 65.0 91.3 ✔ 77.6 68.5 51.4 65.2 94.1 OpenChat ✗ 77.9 69.0 54.7 66.9 89.7 ✔ 78.9 62.8 52.5 65.9 86.3 As shown in Table 10, ...
work page 2021
-
[52]
Ablation study about the capacity of MoE-LLaV A.“Res.” represent the input image resolution.∗donates that there is some overlap in the training data. Methods Res. Capacity Image Question Answering Benchmark Toolkit VQAv2 GQA VisWiz SQA I VQAT POPE MMB LLaV A W MM-Vet Avg MoE-LLaV A-1.6B×4-Top2 336 1.5 76.7∗60.3∗ 36.2 62.6 50.1 85.7 60.2 86.8 26.9 60.6 1.0...
work page 2022
-
[53]
Exhibition Board of MoE-LLaV A.MoE-LLaV A demonstrates the ability to detect and answer challenging questions when prompted to verify them. Visual input example, Tricky Question and Image: User If there are factual errors in the questions, point it out; if not, proceed answering the question. What’s happening in the desert? LLaV A-1.5 There are no deserts...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.