pith. sign in

arxiv: 2606.00275 · v1 · pith:Z6BZGL5Rnew · submitted 2026-05-29 · 💻 cs.CV · cs.AI

Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models

Pith reviewed 2026-06-28 22:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords mixture of expertsvision-language modelshyperbolic geometryevidence prioritizationhallucination mitigationasymmetric architecturemultimodal efficiency
0
0 comments X

The pith

AsyMoE models vision-language asymmetry with hyperbolic geometry and evidence-priority experts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current Mixture of Experts approaches in large vision-language models treat vision and language symmetrically, which ignores their asymmetric relationship. Text queries describe only partial aspects of complete visual scenes, creating hierarchical containment that Euclidean spaces cannot properly encode. Language experts in deeper layers shift toward parametric memory and away from the input context, leading to hallucinations. AsyMoE counters this with specialized expert groups that handle modality-specific tasks, use hyperbolic geometry for hierarchy, and prioritize evidence to stay grounded. If correct, this yields measurable gains in accuracy and efficiency on multimodal benchmarks.

Core claim

AsyMoE is a novel architecture that explicitly models the asymmetry in vision-language processing through three specialized expert groups. Intra-modality experts handle modality-specific processing. Hyperbolic inter-modality experts capture hierarchical cross-modal relationships through negative curvature geometry. Evidence-priority language experts suppress parametric memory activation and maintain contextual grounding throughout network depth. This design leads to consistent improvements over baseline methods.

What carries the argument

AsyMoE architecture with three specialized expert groups: intra-modality experts, hyperbolic inter-modality experts using negative curvature geometry, and evidence-priority language experts.

If this is right

  • Achieves average gains of 1.5% over MoE variants on multimodal tasks.
  • Improves up to 3.8% on hallucination-sensitive tasks.
  • Activates 25.45% fewer parameters compared to dense models.
  • Maintains contextual grounding in language experts across all network depths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The hyperbolic component could be tested on other containment-heavy tasks such as visual question answering with nested objects.
  • Evidence prioritization might reduce over-reliance on training data in text-only models without multimodal input.
  • The three-group split suggests a general template for handling asymmetric modalities in future multimodal architectures.

Load-bearing premise

The premise that text and vision form hierarchical containment relationships that Euclidean expert spaces cannot encode and that language experts in deeper layers necessarily lose grounding in the provided context.

What would settle it

A controlled test where replacing the hyperbolic inter-modality experts with Euclidean ones removes the reported gains on tasks involving scene containment, or disabling evidence-priority experts increases hallucination rates in deeper layers.

Figures

Figures reproduced from arXiv: 2606.00275 by Dandan Zhu, Hangxiangpan Wang, Heng Zhang, Huishen Jiao, Yi Zhao, Zijie Zhou.

Figure 1
Figure 1. Figure 1: Motivation for AsyMoE. (a) Cross-modal Association Limitations. Euclidean FFN space with flat geometry limits hierarchical semantic modeling. Text queries describe partial aspects of visual scenes, forming natural containment rela￾tionships. (b) Memory Priority Shift. Attention analysis on Qwen2.5-VL-7B shows language experts shift from evidence￾based reasoning to parametric memory dependence in deeper lay… view at source ↗
Figure 2
Figure 2. Figure 2: Overall framework of AsyMoE: (a) Modality-specific experts in Euclidean space with distorted cross-modal collabora [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training loss and accuracy score of AsyMoE in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance scaling across visual instruction tuning data scales. AsyMoE demonstrates superior scaling efficiency [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Expert Activation Distribution Across Tasks. We visualize the dynamic distribution of activated experts across various [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Layer-wise Attention Gain [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. Recent studies introduce Mixture of Experts (MoE) into LVLMs for improved computational efficiency. However, existing MoE approaches treat visual and linguistic modalities with symmetric architectures, overlooking the inherent asymmetry in how these two modalities are processed. This asymmetry causes two critical issues. First, text and vision form hierarchical rather than parallel relationships, as text queries typically describe partial aspects of complete visual scenes. Euclidean expert space struggles to encode such containment structures. Second, language experts in deeper layers progressively shift from evidence-based processing to parametric memory dependence, losing grounding in the provided visual and linguistic information. To address these issues, we propose AsyMoE, a novel architecture that explicitly models this asymmetry through three specialized expert groups. Intra-modality experts handle modality-specific processing. Hyperbolic inter-modality experts capture hierarchical cross-modal relationships through negative curvature geometry. Evidence-priority language experts suppress parametric memory activation and maintain contextual grounding throughout network depth. Extensive experiments demonstrate that AsyMoE achieves consistent improvements over baseline methods, with average gains of 1.5\% over MoE variants and up to 3.8\% on hallucination-sensitive tasks. AsyMoE activates 25.45\% fewer parameters compared to dense models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents AsyMoE, a novel MoE architecture for large vision-language models that uses three expert groups: intra-modality experts, hyperbolic inter-modality experts to capture hierarchical cross-modal relationships, and evidence-priority language experts to maintain contextual grounding. It claims consistent improvements of 1.5% over MoE variants and up to 3.8% on hallucination-sensitive tasks, while activating 25.45% fewer parameters than dense models.

Significance. The proposed architecture addresses a potentially important asymmetry in how visual and linguistic modalities are processed in LVLMs. The use of hyperbolic geometry for containment structures and mechanisms to reduce parametric memory dependence could lead to more efficient and less hallucinatory multimodal models if the results are substantiated. The reported parameter savings are a notable strength.

major comments (1)
  1. [Abstract] Abstract: The abstract states performance numbers (1.5% average gain, 3.8% on hallucination tasks) but supplies no information on experimental setup, baselines, statistical tests, or ablation controls, making it impossible to assess whether the data support the claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of AsyMoE's potential to address modality asymmetry. We address the single major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract states performance numbers (1.5% average gain, 3.8% on hallucination tasks) but supplies no information on experimental setup, baselines, statistical tests, or ablation controls, making it impossible to assess whether the data support the claims.

    Authors: We acknowledge that the abstract, due to length constraints, focuses on high-level claims without experimental details. The full manuscript provides these in Section 4 (experimental setup with baselines including standard MoE variants and dense models, datasets, and evaluation protocols) and Section 5 (ablation studies and statistical significance via multiple runs with variance). To address the concern, we will revise the abstract to include a concise reference to the evaluation framework and key controls. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract presents a motivation for asymmetric MoE handling of vision-language modalities and reports empirical gains, but contains no equations, fitted parameters, derivations, or self-citations. No derivation chain exists that could reduce to inputs by construction. The full manuscript is referenced externally but the provided text shows a self-contained empirical proposal without load-bearing mathematical reductions or imported uniqueness claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no equations, training details, or parameter counts beyond the headline 25.45% figure are available, so the ledger is necessarily incomplete.

pith-pipeline@v0.9.1-grok · 5780 in / 1068 out tokens · 22316 ms · 2026-06-28T22:41:40.742443+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 28 canonical work pages · 10 internal anchors

  1. [1]

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Jun- yang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision- Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv:2308.12966 [cs.CV] https://arxiv.org/abs/2308.12966

  2. [2]

    Jun Bai, Minghao Tong, Yang Liu, Zixia Jia, and Zilong Zheng. 2025. Under- standing and Leveraging the Expert Specialization of Context Faithfulness in Mixture-of-Experts LLMs. InProceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 21927–21942. doi:10.18653/v1/2...

  3. [3]

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2025. ShareGPT4V: Improving Large Multi-modal Models with Better Captions. InComputer Vision – ECCV 2024, Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol (Eds.). Springer Nature Switzerland, Cham, 370–387

  4. [4]

    2024.LLaV A-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs

    Shaoxiang Chen, Zequn Jie, and Lin Ma. 2024.LLaV A-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs. arXiv:2401.16160 [cs.CV] https://arxiv.org/abs/2401.16160

  5. [5]

    Tianyu Chen, Xingcheng Fu, Yisen Gao, Haodong Qian, Yuecen Wei, Kun Yan, Haoyi Zhou, and Jianxin Li. 2025. Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4112–4121. doi:10.1109/CVPR52734.2025.00389

  6. [6]

    XTuner Contributors. 2023. XTuner: A Toolkit for Efficiently Fine-tuning LLM. https://github.com/InternLM/xtuner

  7. [7]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2023. InstructBLIP: To- wards General-purpose Vision-Language Models with Instruction Tuning. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Glober- son, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. C...

  8. [8]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    DeepSeek-AI. 2024.DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of- Experts Language Model. arXiv:2405.04434 [cs.CL] https://arxiv.org/abs/2405. 04434

  9. [9]

    Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, a...

  10. [10]

    Xiaoran Fan, Tao Ji, Changhao Jiang, Shuo Li, Senjie Jin, Sirui Song, Junke Wang, Boyang Hong, Lu Chen, Guodong Zheng, Ming Zhang, Caishuang Huang, Rui Zheng, Zhiheng Xi, Yuhao Zhou, Shihan Dou, Junjie Ye, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, and Yu-Gang Jiang. 2024. MouSi: Poly-Visual-Expert Vision-Language Models. arXiv:24...

  11. [11]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39. http://jmlr.org/papers/v23/21- 0998.html

  12. [12]

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. 2024. HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vi...

  13. [13]

    Hudson and Christopher D

    Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  14. [14]

    Jacobs, Michael I

    Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton

  15. [15]

    NeuralComputation3,79–87

    Adaptive Mixtures of Local Experts.Neural Computation3, 1 (1991), 79–87. doi:10.1162/neco.1991.3.1.79

  16. [16]

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Ha- jishirzi, and Ali Farhadi. 2016. A Diagram is Worth a Dozen Images. InComputer Vision – ECCV 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer International Publishing, Cham, 235–251

  17. [17]

    2023.Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

    Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. 2023.Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints. arXiv:2212.05055 [cs.LG] https://arxiv.org/abs/2212.05055

  18. [18]

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv:2006.16668 [cs.CL] https://arxiv.org/abs/2006.16668

  19. [19]

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. 2024. SEED-Bench: Benchmarking Multimodal Large Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13299–13308

  20. [20]

    2025.Aria: An Open Multimodal Native Mixture-of-Experts Model

    Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen Huang, Yanpeng Li, Chongyan Zhu, Xiaoyi Ren, Chao Li, Yifan Ye, Peng Liu, Lihuan Zhang, Hanshu Yan, Guoyin Wang, Bei Chen, and Junnan Li. 2025.Aria: An Open Multimodal Native Mixture-of-Experts Model. arXiv:2410.05993 [cs.CV] https://arxiv.org/abs/2410.05993

  21. [21]

    Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, and Longyin Wen. 2024. CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts. arXiv:2405.05949 [cs.CV] https://arxiv.org/ abs/2405.05949

  22. [22]

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. 2023. Evaluating Object Hallucination in Large Vision-Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 292–305. doi:10.18653...

  23. [23]

    Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

    Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. 2024.Mini-Gemini: Mining the Po- tential of Multi-modality Vision Language Models. arXiv:2403.18814 [cs.CV] https://arxiv.org/abs/2403.18814

  24. [24]

    Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2023. Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions. arXiv:2209.03430 [cs.LG] https://arxiv.org/abs/2209.03430

  25. [25]

    MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

    Bin Lin, Zhenyu Tang, Yang Ye, Jinfa Huang, Junwu Zhang, Yatian Pang, Peng Jin, Munan Ning, Jiebo Luo, and Li Yuan. 2024.MoE-LLaV A: Mixture of Experts for Large Vision-Language Models. arXiv:2401.15947 [cs.CV] https://arxiv.org/ abs/2401.15947

  26. [26]

    2024.MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-A ware Experts

    Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Ghosh, Luke Zettlemoyer, and Armen Aghajanyan. 2024.MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-A ware Experts. arXiv:2407.21770 [cs.AI] https://arxiv.org/abs/2407.21770

  27. [27]

    2024.SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

    Dongyang Liu, Renrui Zhang, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, Yu Qiao, and Peng Gao. 2024.SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models. arXiv:2402.05935 [cs.CV] https://...

  28. [28]

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved Baselines with Visual Instruction Tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 26296–26306

  29. [29]

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024. LLaVA-NeXT: Improved reasoning, OCR, and world knowl- edge. https://llava-vl.github.io/blog/2024-01-30-llava-next/

  30. [30]

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. 2025. MMBench: Is Your Multi-modal Model an All-Around Player?. In Computer Vision – ECCV 2024, Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol (Eds.). Springer Nature ...

  31. [31]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. 2024.DeepSeek-VL: Towards Real-World Vision- Language Understanding. arXiv:2403.05525 [cs.AI] https://arxiv.org/abs/2403. 05525

  32. [32]

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao

  33. [33]

    InInternational Conference on Learning Representations, B

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. InInternational Conference on Learning Representations, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.), Vol. 2024. 23439–23554. https://proceedings.iclr.cc/paper_files/paper/2024/file/ 663bce02a0050c4a11f1eb8a7f1429d3-Paper-Conference.pdf

  34. [34]

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. InAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35....

  35. [35]

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, ...

  36. [36]

    Minesh Mathew, Dimosthenis Karatzas, and C.V. Jawahar. 2021. DocVQA: A Dataset for VQA on Document Images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 2200–2209. ICMR ’26, June 16–19, 2026, Amsterdam, Netherlands Zijie Zhou et al

  37. [37]

    Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Al...

  38. [38]

    MistralAITeam. 2023. Mixtral of experts A high quality Sparse Mixture-of-Experts. [EB/OL]. https://mistral.ai/news/mixtral-of-experts/ Accessed December 11, 2023

  39. [39]

    OLMoE: Open Mixture-of-Experts Language Models

    Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, and Hannaneh Hajishirzi. 2...

  40. [40]

    2022.DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

    Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Am- inabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022.DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. arXiv:2201.05596 [cs.LG] https://arxiv.org/abs/2201.05596

  41. [41]

    2023.Scaling Vision-Language Models with Sparse Mixture of Experts

    Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, and Yux- iong He. 2023.Scaling Vision-Language Models with Sparse Mixture of Experts. arXiv:2303.07226 [cs.CV] https://arxiv.org/abs/2303.07226

  42. [42]

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA Models That Can Read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  43. [43]

    Dianyi Wang, Siyuan Wang, Zejun Li, Yikun Wang, Yitong Li, Duyu Tang, Xiaoyu Shen, Xuanjing Huang, and Zhongyu Wei. 2025. MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models.arXiv:2508.09779 (2025)

  44. [44]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. 2024.DeepSeek-VL2: Mixture-of-Experts ...

  45. [45]

    xAI. 2024. Grok-1. Online. https://github.com/xai-org/grok-1

  46. [46]

    2024.OpenMoE: An Early Effort on Open Mixture-of-Experts Lan- guage Models

    Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. 2024.OpenMoE: An Early Effort on Open Mixture-of-Experts Lan- guage Models. arXiv:2402.01739 [cs.CL] https://arxiv.org/abs/2402.01739

  47. [47]

    Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. 2024. mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13040–13051

  48. [48]

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. 2024. MM-Vet: Evaluating Large Mul- timodal Models for Integrated Capabilities. InProceedings of the 41st Interna- tional Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katheri...

  49. [49]

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. 2024. MMMU: A Massive Multi- discipline Multimodal Understanding and Reasoning Benchmark ...

  50. [50]

    Sashuai Zhou, Hai Huang, and Yan Xia. 2025. Enhancing Multi-modal Models with Heterogeneous MoE Adapters for Fine-tuning. arXiv:2503.20633 [cs.LG] https://arxiv.org/abs/2503.20633

  51. [51]

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny

  52. [52]

    InInternational Conference on Learning Representa- tions, B

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. InInternational Conference on Learning Representa- tions, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.), Vol. 2024. 18378–18394. https://proceedings.iclr.cc/paper_files/paper/2024/file/ 50623630a2372839c078474efa6c0cb8-Paper-Conference.pdf

  53. [53]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. 2022.ST-MoE: Designing Stable and Transfer- able Sparse Expert Models. arXiv:2202.08906 [cs.CL] https://arxiv.org/abs/2202. 08906