pith. machine review for the scientific record. sign in

arxiv: 2604.23996 · v1 · submitted 2026-04-27 · 💻 cs.CV

Recognition: unknown

SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs

Anzhou Hou, Ertao Zhao, Jiale Yan, Kaiwen Long, Mo Guang, Rinyoichi Takezoe, Tianxiang Pan, Yaqian Li, Zi-Hao Bo

Pith reviewed 2026-05-08 04:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords mixture of expertsvision-language modelsexpert routingmodality specializationmutual information regularizationmodel efficiencymultimodal learning
0
0 comments X

The pith

Dynamic soft modality scores align expert routing to layer-dependent fusion in MoE-VLMs

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current routing methods in Mixture-of-Experts vision-language models rely on hand-crafted or modality-agnostic strategies that overlook how vision and language signals fuse differently at each layer. SMoES counters this with dynamic soft modality scores, an expert binning scheme matched to parallel deployment, and inter-bin mutual information regularization to foster coherent specialization. If the approach holds, models should realize both higher task performance and lower distributed overhead without redesigning the underlying architecture. The paper tests this claim on four MoE-based VLMs across sixteen benchmarks and reports consistent gains.

Core claim

SMoES computes layer-specific soft modality scores from either attention weights or Gaussian statistics, bins experts for expert-parallel hardware, and applies mutual information regularization across bins to encourage modality-aware specialization that respects observed fusion patterns in MoE-VLMs.

What carries the argument

Dynamic soft modality scores paired with inter-bin mutual information regularization that steer expert assignment toward modality coherence

If this is right

  • Average gains of 0.9 percent on multimodal tasks and 4.2 percent on language-only tasks across the sixteen benchmarks.
  • Reduction of expert-parallel communication overhead by 56.1 percent.
  • Throughput increase of 12.3 percent under realistic deployment conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scoring and regularization pattern could be tested in non-VLM multimodal models that use expert parallelism.
  • Layer-wise analysis of the learned scores might expose which depths benefit most from modality guidance.
  • SMoES could be combined with learned router parameters to explore hybrid static-dynamic routing.

Load-bearing premise

The soft modality scores and mutual information regularization will produce stable expert specialization that matches real layer-wise fusion patterns without instability or negative transfer.

What would settle it

Replace SMoES routing with a standard modality-agnostic router in the same four VLMs, rerun the sixteen benchmarks, and observe zero or negative change in the reported accuracy, communication, and throughput metrics.

Figures

Figures reproduced from arXiv: 2604.23996 by Anzhou Hou, Ertao Zhao, Jiale Yan, Kaiwen Long, Mo Guang, Rinyoichi Takezoe, Tianxiang Pan, Yaqian Li, Zi-Hao Bo.

Figure 1
Figure 1. Figure 1: Comparison of routing strategies in MoE-VLMs. Hard view at source ↗
Figure 2
Figure 2. Figure 2: Modality fusion pattern. Token feature distributions view at source ↗
Figure 3
Figure 3. Figure 3: Overview of SMOES. Soft Modality Scores: Tokens start with hard modality IDs (0/1 for vision/text) and are progressively refined into layer-wise soft scores M ∈ [0, 1]. We estimate them via (i) attention-accumulated (depicted here) or (ii) Gaussian-statistics estimation (detailed in view at source ↗
Figure 4
Figure 4. Figure 4: Gaussian-statistics score estimation. Per-modality Gaussian distributions are continuously updated from token batches. For each token, we compute modality affinity as soft scores based on log-likelihood under each distribution. 3.2.1. Attention-Accumulated Score The attention mechanism provides a natural pathway for modality signal propagation: when a token attends to oth￾ers, it absorbs their modality cha… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on number of expert bins (SMoES view at source ↗
Figure 6
Figure 6. Figure 6: Routing distribution of tokens to experts in DeepSeek view at source ↗
Figure 7
Figure 7. Figure 7: Evolution of expert specialization during training. view at source ↗
Figure 8
Figure 8. Figure 8: Soft modality score across layers in DeepSeekMoE. view at source ↗
Figure 9
Figure 9. Figure 9: Expert-parallel (EP) deployment on two GPUs. view at source ↗
read the original abstract

Mixture-of-Experts (MoE) has become a prevalent backbone for large vision-language models (VLMs), yet how modality-specific signals should guide expert routing remains under-explored. Existing routing strategies are either hand-crafted or modality-agnostic, relying on idealized priors that ignore the layer-dependent modality fusion patterns in MoE-VLMs and provide little guidance for expert specialization. We propose Soft Modality-guided Expert Specialization (SMoES), which consists of dynamic soft modality scores that capture layer-dependent fusion patterns, an expert binning mechanism aligned with expert-parallel deployment, and an inter-bin mutual information regularization that encourages coherent modality specialization. Our method leverages attention-based or Gaussian-statistics modality scores to optimize mutual information regularization. Experiments across four MoE-based VLMs and 16 benchmarks demonstrate improvement on both effectiveness and efficiency: 0.9% and 4.2% average gain on multimodal and language-only tasks, 56.1% reduction in EP communication overhead, and 12.3% throughput improvement under realistic deployment. These results validate that aligning routing with modality-aware expert specialization unlocks MoE-VLM capacity and efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Soft Modality-guided Expert Specialization (SMoES) for Mixture-of-Experts vision-language models. It introduces dynamic soft modality scores (via attention or Gaussian statistics) to capture layer-dependent fusion patterns, an expert binning mechanism aligned with expert-parallel deployment, and inter-bin mutual information regularization to promote coherent modality specialization. Experiments on four MoE-VLMs across 16 benchmarks report average gains of 0.9% on multimodal tasks and 4.2% on language-only tasks, plus efficiency gains of 56.1% reduced EP communication overhead and 12.3% higher throughput.

Significance. If the empirical results hold under rigorous validation, the work would be moderately significant for the MoE-VLM community. It directly targets an under-explored gap—modality-aware routing that respects layer-dependent fusion—rather than relying on hand-crafted or modality-agnostic strategies. The combination of soft scores, deployment-aligned binning, and MI regularization is a coherent empirical proposal, and the multi-model, multi-benchmark evaluation (four VLMs, 16 tasks) provides a reasonable breadth for assessing practical impact on both accuracy and efficiency. Strengths include the explicit linkage to expert-parallel deployment constraints and the focus on avoiding negative transfer via regularization.

major comments (2)
  1. [Abstract] Abstract: The headline performance claims (0.9% multimodal and 4.2% language-only average gains, 56.1% EP overhead reduction, 12.3% throughput improvement) are presented without any description of the baselines, number of runs, variance, statistical significance testing, or ablation studies. This absence makes it impossible to determine whether the reported deltas are robust or attributable to the proposed components rather than implementation details or hyperparameter tuning.
  2. [Method / Experiments] The central assumption that dynamic soft modality scores plus inter-bin MI regularization produce stable, layer-aware expert specialization without negative transfer or routing instability is load-bearing for the efficiency and effectiveness claims, yet no analysis of routing entropy, expert utilization histograms, or failure cases under this regularization is referenced.
minor comments (2)
  1. [Abstract] The abstract would be clearer if it named the four specific MoE-VLMs and the 16 benchmarks (or at least their categories) rather than leaving them as aggregates.
  2. [Abstract] Notation for the soft modality scores and the mutual-information term should be introduced with explicit equations even in the abstract to avoid ambiguity between attention-based and Gaussian-statistics variants.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and have revised the manuscript to improve the presentation of empirical results and add supporting analysis on routing behavior.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline performance claims (0.9% multimodal and 4.2% language-only average gains, 56.1% EP overhead reduction, 12.3% throughput improvement) are presented without any description of the baselines, number of runs, variance, statistical significance testing, or ablation studies. This absence makes it impossible to determine whether the reported deltas are robust or attributable to the proposed components rather than implementation details or hyperparameter tuning.

    Authors: We agree that the abstract would benefit from additional context to allow readers to assess the robustness of the reported gains. In the revised manuscript, we have updated the abstract to note that results are averaged over three independent runs, with standard deviations and full baseline comparisons (including standard MoE routing and modality-agnostic variants) provided in Section 4 and Table 1. Ablation studies isolating each component are detailed in Section 4.3, and statistical significance is assessed via paired t-tests as described in the experimental protocol. These changes preserve abstract conciseness while directing readers to the supporting evidence. revision: yes

  2. Referee: [Method / Experiments] The central assumption that dynamic soft modality scores plus inter-bin MI regularization produce stable, layer-aware expert specialization without negative transfer or routing instability is load-bearing for the efficiency and effectiveness claims, yet no analysis of routing entropy, expert utilization histograms, or failure cases under this regularization is referenced.

    Authors: We acknowledge the value of explicit validation for routing stability. The original manuscript contains ablations on the MI regularization and modality scoring mechanisms in Section 4.3, demonstrating performance improvements without degradation. However, we did not include dedicated analysis of routing entropy or utilization histograms. In the revision, we have added a new subsection (Section 4.4) with layer-wise routing entropy plots, modality-specific expert utilization histograms, and a brief discussion of observed edge cases (e.g., high-variance layers). These additions confirm stable specialization and absence of negative transfer under the proposed regularization. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents SMoES as an empirical proposal: dynamic soft modality scores (attention-based or Gaussian-statistics), expert binning for EP deployment, and inter-bin MI regularization to encourage specialization. The headline gains (0.9%/4.2% task improvements, 56.1% EP overhead reduction, 12.3% throughput) are reported from experiments across four VLMs and 16 benchmarks. No equations, derivations, or self-citations are shown that reduce these outcomes to quantities defined by the method's own fitted parameters or prior author results. The mechanism is described as a practical routing alignment without load-bearing self-referential steps or renaming of known patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method relies on standard MoE concepts plus the new regularization term.

pith-pipeline@v0.9.0 · 5532 in / 1060 out tokens · 26788 ms · 2026-05-08T04:36:42.675448+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 34 canonical work pages · 17 internal anchors

  1. [1]

    Ming-omni: A unified multimodal model for perception and generation.arXiv preprint arXiv:2506.09344, 2025

    Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, et al. Ming-omni: A unified multimodal model for perception and generation. arXiv preprint arXiv:2506.09344, 2025. 2

  2. [2]

    Vlmo: Unified vision-language pre-training with mixture-of-modality-experts.Advances in neural information processing systems, 35:32897–32912,

    Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts.Advances in neural information processing systems, 35:32897–32912,

  3. [3]

    Long-tailed distribution-aware router for mixture- of-experts in large vision-language model.arXiv preprint arXiv:2507.01351, 2025

    Chaoxiang Cai, Longrong Yang, Kaibing Chen, Fan Yang, and Xi Li. Long-tailed distribution-aware router for mixture- of-experts in large vision-language model.arXiv preprint arXiv:2507.01351, 2025. 3

  4. [4]

    Eve: Efficient vision- language pre-training with masked prediction and modality- aware moe

    Junyi Chen, Longteng Guo, Jia Sun, Shuai Shao, Zehuan Yuan, Liang Lin, and Dongyu Zhang. Eve: Efficient vision- language pre-training with masked prediction and modality- aware moe. InProceedings of the AAAI Conference on Arti- ficial Intelligence, pages 1110–1119, 2024. 3

  5. [5]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024. 2

  6. [6]

    Llava-mole: Sparse mixture of lora experts for mitigating data con- flicts in instruction finetuning mllms.arXiv preprint arXiv:2401.16160, 2024

    Shaoxiang Chen, Zequn Jie, and Lin Ma. Llava-mole: Sparse mixture of lora experts for mitigating data con- flicts in instruction finetuning mllms.arXiv preprint arXiv:2401.16160, 2024. 2

  7. [7]

    An efficient general-purpose modular vision model via multi-task heterogeneous training.arXiv preprint arXiv:2306.17165, 2023

    Zitian Chen, Mingyu Ding, Yikang Shen, Wei Zhan, Masayoshi Tomizuka, Erik Learned-Miller, and Chuang Gan. An efficient general-purpose modular vision model via multi-task heterogeneous training.arXiv preprint arXiv:2306.17165, 2023. 2

  8. [8]

    Mod-squad: Designing mixtures of experts as modular multi-task learners

    Zitian Chen, Yikang Shen, Mingyu Ding, Zhenfang Chen, Hengshuang Zhao, Erik G Learned-Miller, and Chuang Gan. Mod-squad: Designing mixtures of experts as modular multi-task learners. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11828–11837, 2023. 2

  9. [9]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

  10. [10]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Train- ing verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 5

  11. [11]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066, 2024. 1, 5

  12. [12]

    Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo- pher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022. 2

  13. [13]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 2

  14. [14]

    Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022. 1, 2, 3

  15. [15]

    Comoe: Contrastive representation for mixture-of-experts in parameter-efficient fine-tuning.arXiv preprint arXiv:2505.17553, 2025

    Jinyuan Feng, Chaopeng Wei, Tenghai Qiu, Tianyi Hu, and Zhiqiang Pu. Comoe: Contrastive representation for mixture-of-experts in parameter-efficient fine-tuning.arXiv preprint arXiv:2505.17553, 2025. 2

  16. [16]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 5

  17. [17]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 5

  18. [18]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 2

  19. [19]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633– 638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633– 638, 2025. 1

  20. [20]

    Capacity- aware inference: Mitigating the straggler effect in mixture of experts.arXiv preprint arXiv:2503.05066, 2025

    Shwai He, Weilin Cai, Jiayi Huang, and Ang Li. Capacity- aware inference: Mitigating the straggler effect in mixture of experts.arXiv preprint arXiv:2503.05066, 2025. 3

  21. [21]

    Mea- suring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding. InInter- national Conference on Learning Representations, 2021. 5

  22. [22]

    Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025. 1, 2

  23. [23]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 5

  24. [24]

    Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding

    Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700– 13710, 2024. 2

  25. [25]

    Evomoe: Expert evolution in mixture of experts for multimodal large language models.arXiv preprint arXiv:2505.23830, 2025

    Linglin Jing, Yuting Gao, Zhigang Wang, Wang Lan, Yi- wen Tang, Wenhai Wang, Kaipeng Zhang, and Qingpei Guo. Evomoe: Expert evolution in mixture of experts for multimodal large language models.arXiv preprint arXiv:2505.23830, 2025. 2

  26. [26]

    Beyond distillation: Task-level mixture-of-experts for efficient inference

    Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin, Minh-Thang Luong, and Orhan Firat. Beyond distillation: Task-level mixture-of-experts for efficient inference.arXiv preprint arXiv:2110.03742, 2021. 1

  27. [27]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020. 1, 2

  28. [28]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InThe 2023 Conference on Empir- ical Methods in Natural Language Processing, 2023. 5

  29. [29]

    Uni- moe: Scaling unified multimodal llms with mixture of ex- perts.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, and Min Zhang. Uni- moe: Scaling unified multimodal llms with mixture of ex- perts.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2

  30. [30]

    Mixture-of- transformers: A sparse and scalable architecture for multi- modal foundation models.Transactions on Machine Learn- ing Research, 2025

    Weixin Liang, LILI YU, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen tau Yih, Luke Zettlemoyer, and Xi Victoria Lin. Mixture-of- transformers: A sparse and scalable architecture for multi- modal foundation models.Transactions on Machine Learn- ing Research, 2025. 2

  31. [31]

    arXiv preprint arXiv:2401.15947 , year=

    Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Yatian Pang, Munan Ning, et al. Moe-llava: Mixture of experts for large vision- language models.arXiv preprint arXiv:2401.15947, 2024. 2

  32. [32]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958, 2021. 5

  33. [33]

    Moma: Efficient early-fusion pre-training with mixture of modality-aware experts

    Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Ghosh, Luke Zettlemoyer, and Ar- men Aghajanyan. Moma: Efficient early-fusion pre-training with mixture of modality-aware experts.arXiv preprint arXiv:2407.21770, 2024. 2

  34. [34]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 2

  35. [35]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 1, 5

  36. [36]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025. 5, 1

  37. [37]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 5

  38. [38]

    Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

  39. [39]

    Mono-internvl: Pushing the boundaries of monolithic multimodal large lan- guage models with endogenous visual pre-training

    Gen Luo, Xue Yang, Wenhan Dou, Zhaokai Wang, Jiawen Liu, Jifeng Dai, Yu Qiao, and Xizhou Zhu. Mono-internvl: Pushing the boundaries of monolithic multimodal large lan- guage models with endogenous visual pre-training. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 24960–24971, 2025. 2, 5, 6, 3

  40. [40]

    Deepstack: Deeply stacking visual tokens is surprisingly simple and ef- fective for lmms.Advances in Neural Information Process- ing Systems, 37:23464–23487, 2024

    Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zux- uan Wu, Jianfeng Gao, and Yu-Gang Jiang. Deepstack: Deeply stacking visual tokens is surprisingly simple and ef- fective for lmms.Advances in Neural Information Process- ing Systems, 37:23464–23487, 2024. 2

  41. [41]

    Smith, Pang Wei Koh, Aman- preet Singh, and Hannaneh Hajishirzi

    Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Evan Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Aman- preet Singh, and Hannaneh Hajish...

  42. [42]

    Multimodal contrastive learn- ing with limoe: the language-image mixture of experts.Ad- vances in Neural Information Processing Systems, 35:9564– 9576, 2022

    Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. Multimodal contrastive learn- ing with limoe: the language-image mixture of experts.Ad- vances in Neural Information Processing Systems, 35:9564– 9576, 2022. 1, 3, 5

  43. [43]

    From sparse to soft mixtures of experts

    Joan Puigcerver, Carlos Riquelme Ruiz, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts. In The Twelfth International Conference on Learning Represen- tations, 2024. 2

  44. [44]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 5, 1

  45. [45]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer.arXiv preprint arXiv:1701.06538, 2017. 1, 2, 3

  46. [46]

    Mome: Mixture of multimodal experts for gen- eralist multimodal large language models.Advances in neu- ral information processing systems, 37:42048–42070, 2024

    Leyang Shen, Gongwei Chen, Rui Shao, Weili Guan, and Liqiang Nie. Mome: Mixture of multimodal experts for gen- eralist multimodal large language models.Advances in neu- ral information processing systems, 37:42048–42070, 2024. 2

  47. [47]

    Scaling vision-language models with sparse mixture of experts

    Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, and Yuxiong He. Scaling vision-language models with sparse mixture of experts. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. 2

  48. [48]

    Moduleformer: Mod- ularity emerges from mixture-of-experts.arXiv preprint arXiv:2306.04640, 2023

    Yikang Shen, Zheyu Zhang, Tianyou Cao, Shawn Tan, Zhenfang Chen, and Chuang Gan. Moduleformer: Mod- ularity emerges from mixture-of-experts.arXiv preprint arXiv:2306.04640, 2023. 2

  49. [49]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 5

  50. [50]

    Pangu pro moe: Mixture of grouped experts for efficient sparsity.arXiv preprint arXiv:2505.21411, 2025

    Yehui Tang, Xiaosong Li, Fangcheng Liu, Wei Guo, Hang Zhou, Yaoyuan Wang, Kai Han, Xianzhi Yu, Jinpeng Li, Hui Zang, et al. Pangu pro moe: Mixture of grouped experts for efficient sparsity.arXiv preprint arXiv:2505.21411, 2025. 3

  51. [51]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 1, 2

  52. [52]

    Step-3 is large yet affordable: Model-system co-design for cost- effective decoding.arXiv preprint arXiv:2507.19427, 2025

    Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, Siyu Chen, et al. Step-3 is large yet affordable: Model- system co-design for cost-effective decoding.arXiv preprint arXiv:2507.19427, 2025. 2

  53. [53]

    Moiie: Mixture of intra-and inter-modality experts for large vision language models.arXiv preprint arXiv:2508.09779, 2025

    Dianyi Wang, Siyuan Wang, Zejun Li, Yikun Wang, Yi- tong Li, Duyu Tang, Xiaoyu Shen, Xuanjing Huang, and Zhongyu Wei. Moiie: Mixture of intra-and inter-modality experts for large vision language models.arXiv preprint arXiv:2508.09779, 2025. 1, 2, 5, 6, 3

  54. [54]

    Toward cost-efficient serving of mixture-of-experts with asynchrony.arXiv preprint arXiv:2505.08944, 2025

    Shaoyu Wang, Guangrong He, Geon-Woo Kim, Yanqi Zhou, and Seo Jin Park. Toward cost-efficient serving of mixture-of-experts with asynchrony.arXiv preprint arXiv:2505.08944, 2025. 3

  55. [55]

    Image as a foreign language: Beit pretraining for vision and vision- language tasks

    Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil- iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo- hammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for vision and vision- language tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175– 19186, 2023. 1, 2

  56. [56]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 1, 2

  57. [57]

    Note on a method for calculating cor- rected sums of squares and products.Technometrics, 4(3): 419–420, 1962

    Barry Payne Welford. Note on a method for calculating cor- rected sums of squares and products.Technometrics, 4(3): 419–420, 1962. 4

  58. [58]

    Grove MoE: Towards efficient and superior MoE LLMs with adjugate experts.arXiv preprint arXiv:2508.07785, 2025

    Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, Yihong Zhuang, Guoshan Lu, Zenan Huang, Junbo Zhao, Lin Liu, et al. Grove moe: Towards ef- ficient and superior moe llms with adjugate experts.arXiv preprint arXiv:2508.07785, 2025. 3

  59. [59]

    Mixture of loRA experts

    Xun Wu, Shaohan Huang, and Furu Wei. Mixture of loRA experts. InThe Twelfth International Conference on Learn- ing Representations, 2024. 2

  60. [60]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024. 1, 2

  61. [61]

    Smar: Soft modality- aware routing strategy for moe-based multimodal large language models preserving language capabilities.arXiv preprint arXiv:2506.06406, 2025

    Guoyang Xia, Yifeng Ding, Fengfa Li, Lei Ren, Wei Chen, Fangxiang Feng, and Xiaojie Wang. Smar: Soft modality- aware routing strategy for moe-based multimodal large language models preserving language capabilities.arXiv preprint arXiv:2506.06406, 2025. 3, 5, 6

  62. [62]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 5, 1

  63. [63]

    Solving token gradi- ent conflict in mixture-of-experts for large vision-language model.arXiv preprint arXiv:2406.19905, 2024

    Longrong Yang, Dong Shen, Chaoxiang Cai, Fan Yang, Tingting Gao, Di Zhang, and Xi Li. Solving token gradi- ent conflict in mixture-of-experts for large vision-language model.arXiv preprint arXiv:2406.19905, 2024. 3

  64. [64]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556– 9567, 2024. 5

  65. [65]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830, 2019. 5

  66. [66]

    Ef- ficientmoe: Optimizing mixture-of-experts model training with adaptive load balance.IEEE Transactions on Parallel and Distributed Systems, 2025

    Yan Zeng, Chengchuang Huang, Yipeng Mei, Lifu Zhang, Teng Su, Wei Ye, Wenqi Shi, and Shengnan Wang. Ef- ficientmoe: Optimizing mixture-of-experts model training with adaptive load balance.IEEE Transactions on Parallel and Distributed Systems, 2025. 3

  67. [67]

    Learn to be efficient: Build structured sparsity in large lan- guage models.Advances in Neural Information Processing Systems, 37:101969–101991, 2024

    Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Zhuo- qing Morley Mao, Beidi Chen, Fan Lai, and Atul Prakash. Learn to be efficient: Build structured sparsity in large lan- guage models.Advances in Neural Information Processing Systems, 37:101969–101991, 2024. 2

  68. [68]

    Mixture-of-experts with expert choice routing.Ad- vances in Neural Information Processing Systems, 35:7103– 7114, 2022

    Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing.Ad- vances in Neural Information Processing Systems, 35:7103– 7114, 2022. 2 Soft Modality-Guided Expert Specialization in MoE-VLMs Supplementary Material

  69. [69]

    Detailed Experimental Setup 1.1. Vision Encoder and Projector All VLM configurations use the same vision encoder and projector so that backbone and routing changes are the only variables: Vision Encoder:CLIP ViT-L/14 [44] with 336×336 in- put resolution, producing 576 visual tokens for every im- age.Projector:A two-layer MLP with GELU activation that maps...

  70. [70]

    Main results on Moonlight-MoE and Qwen3- MoE In addition to the primary evaluation on DeepSeekMoE and OLMoE reported in Tab

    Supplementary Results 2.1. Main results on Moonlight-MoE and Qwen3- MoE In addition to the primary evaluation on DeepSeekMoE and OLMoE reported in Tab. 1, we provide the detailed per- formance on Moonlight-MoE and Qwen3-MoE in Tab. S4. The results demonstrate that SMoES consistently outper- forms various routing baselines across these backbones, ex- hibit...