MoVA: Adapting Mixture of Vision Experts to Multimodal Context

Bingqi Ma; Dazhong Shen; Dongzhi Jiang; Guanglu Song; Hao Shao; Hongsheng Li; Yu Liu; Zhuofan Zong

arxiv: 2404.13046 · v2 · pith:UOOQVITQnew · submitted 2024-04-19 · 💻 cs.CV

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

Zhuofan Zong , Bingqi Ma , Dazhong Shen , Guanglu Song , Hao Shao , Dongzhi Jiang , Hongsheng Li , Yu Liu This is my paper

classification 💻 cs.CV

keywords visionexpertsencoderimagemultimodalunderstandingabilityclip

0 comments

read the original abstract

As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, e.g., the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts according to the user instruction, input image, and expertise of vision experts. This benefits from the powerful model function understanding ability of the large language model (LLM). In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge from various experts. This coarse-to-fine paradigm effectively leverages representations from experts based on multimodal context and model expertise, further enhancing the generalization ability. We conduct extensive experiments to evaluate the effectiveness of the proposed approach. Without any bells and whistles, MoVA can achieve significant performance gains over current state-of-the-art methods in a wide range of challenging multimodal benchmarks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs
cs.CV 2026-06 unverdicted novelty 6.0

Retraining all 31 subsets of five vision encoders shows Capacity and Necessity are distinct, pre-projector effective rank predicts residual performance at fixed parameter count, and high-Capacity plus adaptive complem...
MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning
cs.AI 2026-06 unverdicted novelty 5.0

MathVis-Fine proposes a dataset with fine-grained visual annotations and dependency ratings plus a progressive two-stage training paradigm to align visual supervision with sample-specific necessity in multimodal mathe...
NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation
cs.CV 2025-10 unverdicted novelty 5.0

NoisyGRPO is an RL framework that perturbs visual inputs with Gaussian noise for exploration and computes trajectory advantages via Bayesian posterior fusion of noise prior and reward likelihood to improve multimodal ...