Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models

Dandan Zhu; Hangxiangpan Wang; Heng Zhang; Huishen Jiao; Yi Zhao; Zijie Zhou

arxiv: 2606.00275 · v1 · pith:Z6BZGL5Rnew · submitted 2026-05-29 · 💻 cs.CV · cs.AI

Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models

Zijie Zhou , Dandan Zhu , Hangxiangpan Wang , Heng Zhang , Huishen Jiao , Yi Zhao This is my paper

Pith reviewed 2026-06-28 22:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords mixture of expertsvision-language modelshyperbolic geometryevidence prioritizationhallucination mitigationasymmetric architecturemultimodal efficiency

0 comments

The pith

AsyMoE models vision-language asymmetry with hyperbolic geometry and evidence-priority experts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current Mixture of Experts approaches in large vision-language models treat vision and language symmetrically, which ignores their asymmetric relationship. Text queries describe only partial aspects of complete visual scenes, creating hierarchical containment that Euclidean spaces cannot properly encode. Language experts in deeper layers shift toward parametric memory and away from the input context, leading to hallucinations. AsyMoE counters this with specialized expert groups that handle modality-specific tasks, use hyperbolic geometry for hierarchy, and prioritize evidence to stay grounded. If correct, this yields measurable gains in accuracy and efficiency on multimodal benchmarks.

Core claim

AsyMoE is a novel architecture that explicitly models the asymmetry in vision-language processing through three specialized expert groups. Intra-modality experts handle modality-specific processing. Hyperbolic inter-modality experts capture hierarchical cross-modal relationships through negative curvature geometry. Evidence-priority language experts suppress parametric memory activation and maintain contextual grounding throughout network depth. This design leads to consistent improvements over baseline methods.

What carries the argument

AsyMoE architecture with three specialized expert groups: intra-modality experts, hyperbolic inter-modality experts using negative curvature geometry, and evidence-priority language experts.

If this is right

Achieves average gains of 1.5% over MoE variants on multimodal tasks.
Improves up to 3.8% on hallucination-sensitive tasks.
Activates 25.45% fewer parameters compared to dense models.
Maintains contextual grounding in language experts across all network depths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The hyperbolic component could be tested on other containment-heavy tasks such as visual question answering with nested objects.
Evidence prioritization might reduce over-reliance on training data in text-only models without multimodal input.
The three-group split suggests a general template for handling asymmetric modalities in future multimodal architectures.

Load-bearing premise

The premise that text and vision form hierarchical containment relationships that Euclidean expert spaces cannot encode and that language experts in deeper layers necessarily lose grounding in the provided context.

What would settle it

A controlled test where replacing the hyperbolic inter-modality experts with Euclidean ones removes the reported gains on tasks involving scene containment, or disabling evidence-priority experts increases hallucination rates in deeper layers.

Figures

Figures reproduced from arXiv: 2606.00275 by Dandan Zhu, Hangxiangpan Wang, Heng Zhang, Huishen Jiao, Yi Zhao, Zijie Zhou.

**Figure 1.** Figure 1: Motivation for AsyMoE. (a) Cross-modal Association Limitations. Euclidean FFN space with flat geometry limits hierarchical semantic modeling. Text queries describe partial aspects of visual scenes, forming natural containment relationships. (b) Memory Priority Shift. Attention analysis on Qwen2.5-VL-7B shows language experts shift from evidencebased reasoning to parametric memory dependence in deeper lay… view at source ↗

**Figure 2.** Figure 2: Overall framework of AsyMoE: (a) Modality-specific experts in Euclidean space with distorted cross-modal collabora [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Training loss and accuracy score of AsyMoE in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Performance scaling across visual instruction tuning data scales. AsyMoE demonstrates superior scaling efficiency [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Expert Activation Distribution Across Tasks. We visualize the dynamic distribution of activated experts across various [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Layer-wise Attention Gain [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. Recent studies introduce Mixture of Experts (MoE) into LVLMs for improved computational efficiency. However, existing MoE approaches treat visual and linguistic modalities with symmetric architectures, overlooking the inherent asymmetry in how these two modalities are processed. This asymmetry causes two critical issues. First, text and vision form hierarchical rather than parallel relationships, as text queries typically describe partial aspects of complete visual scenes. Euclidean expert space struggles to encode such containment structures. Second, language experts in deeper layers progressively shift from evidence-based processing to parametric memory dependence, losing grounding in the provided visual and linguistic information. To address these issues, we propose AsyMoE, a novel architecture that explicitly models this asymmetry through three specialized expert groups. Intra-modality experts handle modality-specific processing. Hyperbolic inter-modality experts capture hierarchical cross-modal relationships through negative curvature geometry. Evidence-priority language experts suppress parametric memory activation and maintain contextual grounding throughout network depth. Extensive experiments demonstrate that AsyMoE achieves consistent improvements over baseline methods, with average gains of 1.5\% over MoE variants and up to 3.8\% on hallucination-sensitive tasks. AsyMoE activates 25.45\% fewer parameters compared to dense models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AsyMoE pairs hyperbolic inter-modality experts with evidence-priority language experts to address vision-text asymmetry in LVLMs, reporting modest gains but with limited experimental detail visible.

read the letter

The main takeaway is that the paper presents AsyMoE as an asymmetric MoE variant for LVLMs. It splits experts into intra-modality, hyperbolic inter-modality for hierarchical containment, and evidence-priority language experts meant to reduce drift to parametric memory. The abstract claims 1.5% average improvement over other MoE setups, up to 3.8% on hallucination tasks, and 25.45% fewer activated parameters than dense models.

What is actually new is the explicit split that treats the modalities differently, using negative curvature to encode text-as-partial-description-of-vision and a mechanism to keep language experts tied to input evidence. Prior symmetric MoE work is cited as the baseline, so the combination looks like a targeted extension rather than a complete reinvention.

The paper does a clean job stating the two motivating problems and mapping each expert group to one of them. The efficiency angle and hallucination focus line up with current deployment needs.

The soft spots are the missing experimental controls. The abstract gives performance numbers but no setup, no ablation results, no statistical tests, and no comparison details. That makes it hard to tell whether the gains come from the hyperbolic or evidence-priority pieces or from other training choices. The core premises about hierarchical containment and progressive loss of grounding are plausible but rest on the reader accepting them without shown supporting analysis.

This is for groups already working on MoE scaling or hallucination fixes in multimodal models. A reader who wants concrete architecture tweaks for efficiency might pull useful ideas from the expert design.

It is worth sending for peer review so the full methods and tables can be checked.

Referee Report

1 major / 0 minor

Summary. The manuscript presents AsyMoE, a novel MoE architecture for large vision-language models that uses three expert groups: intra-modality experts, hyperbolic inter-modality experts to capture hierarchical cross-modal relationships, and evidence-priority language experts to maintain contextual grounding. It claims consistent improvements of 1.5% over MoE variants and up to 3.8% on hallucination-sensitive tasks, while activating 25.45% fewer parameters than dense models.

Significance. The proposed architecture addresses a potentially important asymmetry in how visual and linguistic modalities are processed in LVLMs. The use of hyperbolic geometry for containment structures and mechanisms to reduce parametric memory dependence could lead to more efficient and less hallucinatory multimodal models if the results are substantiated. The reported parameter savings are a notable strength.

major comments (1)

[Abstract] Abstract: The abstract states performance numbers (1.5% average gain, 3.8% on hallucination tasks) but supplies no information on experimental setup, baselines, statistical tests, or ablation controls, making it impossible to assess whether the data support the claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of AsyMoE's potential to address modality asymmetry. We address the single major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states performance numbers (1.5% average gain, 3.8% on hallucination tasks) but supplies no information on experimental setup, baselines, statistical tests, or ablation controls, making it impossible to assess whether the data support the claims.

Authors: We acknowledge that the abstract, due to length constraints, focuses on high-level claims without experimental details. The full manuscript provides these in Section 4 (experimental setup with baselines including standard MoE variants and dense models, datasets, and evaluation protocols) and Section 5 (ablation studies and statistical significance via multiple runs with variance). To address the concern, we will revise the abstract to include a concise reference to the evaluation framework and key controls. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract presents a motivation for asymmetric MoE handling of vision-language modalities and reports empirical gains, but contains no equations, fitted parameters, derivations, or self-citations. No derivation chain exists that could reduce to inputs by construction. The full manuscript is referenced externally but the provided text shows a self-contained empirical proposal without load-bearing mathematical reductions or imported uniqueness claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no equations, training details, or parameter counts beyond the headline 25.45% figure are available, so the ledger is necessarily incomplete.

pith-pipeline@v0.9.1-grok · 5780 in / 1068 out tokens · 22316 ms · 2026-06-28T22:41:40.742443+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 28 canonical work pages · 10 internal anchors

[1]

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Jun- yang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision- Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv:2308.12966 [cs.CV] https://arxiv.org/abs/2308.12966

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Jun Bai, Minghao Tong, Yang Liu, Zixia Jia, and Zilong Zheng. 2025. Under- standing and Leveraging the Expert Specialization of Context Faithfulness in Mixture-of-Experts LLMs. InProceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 21927–21942. doi:10.18653/v1/2...

work page doi:10.18653/v1/2025.emnlp-main.1114 2025
[3]

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2025. ShareGPT4V: Improving Large Multi-modal Models with Better Captions. InComputer Vision – ECCV 2024, Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol (Eds.). Springer Nature Switzerland, Cham, 370–387

2025
[4]

2024.LLaV A-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs

Shaoxiang Chen, Zequn Jie, and Lin Ma. 2024.LLaV A-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs. arXiv:2401.16160 [cs.CV] https://arxiv.org/abs/2401.16160

work page arXiv 2024
[5]

Tianyu Chen, Xingcheng Fu, Yisen Gao, Haodong Qian, Yuecen Wei, Kun Yan, Haoyi Zhou, and Jianxin Li. 2025. Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4112–4121. doi:10.1109/CVPR52734.2025.00389

work page doi:10.1109/cvpr52734.2025.00389 2025
[6]

XTuner Contributors. 2023. XTuner: A Toolkit for Efficiently Fine-tuning LLM. https://github.com/InternLM/xtuner

2023
[7]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2023. InstructBLIP: To- wards General-purpose Vision-Language Models with Instruction Tuning. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Glober- son, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. C...

2023
[8]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI. 2024.DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of- Experts Language Model. arXiv:2405.04434 [cs.CL] https://arxiv.org/abs/2405. 04434

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, a...

work page arXiv 2022
[10]

Xiaoran Fan, Tao Ji, Changhao Jiang, Shuo Li, Senjie Jin, Sirui Song, Junke Wang, Boyang Hong, Lu Chen, Guodong Zheng, Ming Zhang, Caishuang Huang, Rui Zheng, Zhiheng Xi, Yuhao Zhou, Shihan Dou, Junjie Ye, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, and Yu-Gang Jiang. 2024. MouSi: Poly-Visual-Expert Vision-Language Models. arXiv:24...

work page arXiv 2024
[11]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39. http://jmlr.org/papers/v23/21- 0998.html

2022
[12]

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. 2024. HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vi...

2024
[13]

Hudson and Christopher D

Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2019
[14]

Jacobs, Michael I

Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton
[15]

NeuralComputation3,79–87

Adaptive Mixtures of Local Experts.Neural Computation3, 1 (1991), 79–87. doi:10.1162/neco.1991.3.1.79

work page doi:10.1162/neco.1991.3.1.79 1991
[16]

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Ha- jishirzi, and Ali Farhadi. 2016. A Diagram is Worth a Dozen Images. InComputer Vision – ECCV 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer International Publishing, Cham, 235–251

2016
[17]

2023.Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. 2023.Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints. arXiv:2212.05055 [cs.LG] https://arxiv.org/abs/2212.05055

work page arXiv 2023
[18]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv:2006.16668 [cs.CL] https://arxiv.org/abs/2006.16668

work page internal anchor Pith review Pith/arXiv arXiv 2020
[19]

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. 2024. SEED-Bench: Benchmarking Multimodal Large Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13299–13308

2024
[20]

2025.Aria: An Open Multimodal Native Mixture-of-Experts Model

Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen Huang, Yanpeng Li, Chongyan Zhu, Xiaoyi Ren, Chao Li, Yifan Ye, Peng Liu, Lihuan Zhang, Hanshu Yan, Guoyin Wang, Bei Chen, and Junnan Li. 2025.Aria: An Open Multimodal Native Mixture-of-Experts Model. arXiv:2410.05993 [cs.CV] https://arxiv.org/abs/2410.05993

work page arXiv 2025
[21]

Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, and Longyin Wen. 2024. CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts. arXiv:2405.05949 [cs.CV] https://arxiv.org/ abs/2405.05949

work page arXiv 2024
[22]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. 2023. Evaluating Object Hallucination in Large Vision-Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 292–305. doi:10.18653...

work page doi:10.18653/v1/2023.emnlp-main.20 2023
[23]

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. 2024.Mini-Gemini: Mining the Po- tential of Multi-modality Vision Language Models. arXiv:2403.18814 [cs.CV] https://arxiv.org/abs/2403.18814

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2023. Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions. arXiv:2209.03430 [cs.LG] https://arxiv.org/abs/2209.03430

work page arXiv 2023
[25]

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Bin Lin, Zhenyu Tang, Yang Ye, Jinfa Huang, Junwu Zhang, Yatian Pang, Peng Jin, Munan Ning, Jiebo Luo, and Li Yuan. 2024.MoE-LLaV A: Mixture of Experts for Large Vision-Language Models. arXiv:2401.15947 [cs.CV] https://arxiv.org/ abs/2401.15947

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

2024.MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-A ware Experts

Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Ghosh, Luke Zettlemoyer, and Armen Aghajanyan. 2024.MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-A ware Experts. arXiv:2407.21770 [cs.AI] https://arxiv.org/abs/2407.21770

work page arXiv 2024
[27]

2024.SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Dongyang Liu, Renrui Zhang, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, Yu Qiao, and Peng Gao. 2024.SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models. arXiv:2402.05935 [cs.CV] https://...

work page arXiv 2024
[28]

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved Baselines with Visual Instruction Tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 26296–26306

2024
[29]

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024. LLaVA-NeXT: Improved reasoning, OCR, and world knowl- edge. https://llava-vl.github.io/blog/2024-01-30-llava-next/

2024
[30]

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. 2025. MMBench: Is Your Multi-modal Model an All-Around Player?. In Computer Vision – ECCV 2024, Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol (Eds.). Springer Nature ...

2025
[31]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. 2024.DeepSeek-VL: Towards Real-World Vision- Language Understanding. arXiv:2403.05525 [cs.AI] https://arxiv.org/abs/2403. 05525

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao
[33]

InInternational Conference on Learning Representations, B

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. InInternational Conference on Learning Representations, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.), Vol. 2024. 23439–23554. https://proceedings.iclr.cc/paper_files/paper/2024/file/ 663bce02a0050c4a11f1eb8a7f1429d3-Paper-Conference.pdf

2024
[34]

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. InAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35....

2022
[35]

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, ...

2022
[36]

Minesh Mathew, Dimosthenis Karatzas, and C.V. Jawahar. 2021. DocVQA: A Dataset for VQA on Document Images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 2200–2209. ICMR ’26, June 16–19, 2026, Amsterdam, Netherlands Zijie Zhou et al

2021
[37]

Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Al...

2025
[38]

MistralAITeam. 2023. Mixtral of experts A high quality Sparse Mixture-of-Experts. [EB/OL]. https://mistral.ai/news/mixtral-of-experts/ Accessed December 11, 2023

2023
[39]

OLMoE: Open Mixture-of-Experts Language Models

Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, and Hannaneh Hajishirzi. 2...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

2022.DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Am- inabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022.DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. arXiv:2201.05596 [cs.LG] https://arxiv.org/abs/2201.05596

work page arXiv 2022
[41]

2023.Scaling Vision-Language Models with Sparse Mixture of Experts

Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, and Yux- iong He. 2023.Scaling Vision-Language Models with Sparse Mixture of Experts. arXiv:2303.07226 [cs.CV] https://arxiv.org/abs/2303.07226

work page arXiv 2023
[42]

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA Models That Can Read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2019
[43]

Dianyi Wang, Siyuan Wang, Zejun Li, Yikun Wang, Yitong Li, Duyu Tang, Xiaoyu Shen, Xuanjing Huang, and Zhongyu Wei. 2025. MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models.arXiv:2508.09779 (2025)

work page arXiv 2025
[44]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. 2024.DeepSeek-VL2: Mixture-of-Experts ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

xAI. 2024. Grok-1. Online. https://github.com/xai-org/grok-1

2024
[46]

2024.OpenMoE: An Early Effort on Open Mixture-of-Experts Lan- guage Models

Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. 2024.OpenMoE: An Early Effort on Open Mixture-of-Experts Lan- guage Models. arXiv:2402.01739 [cs.CL] https://arxiv.org/abs/2402.01739

work page arXiv 2024
[47]

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. 2024. mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13040–13051

2024
[48]

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. 2024. MM-Vet: Evaluating Large Mul- timodal Models for Integrated Capabilities. InProceedings of the 41st Interna- tional Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katheri...

2024
[49]

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. 2024. MMMU: A Massive Multi- discipline Multimodal Understanding and Reasoning Benchmark ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Sashuai Zhou, Hai Huang, and Yan Xia. 2025. Enhancing Multi-modal Models with Heterogeneous MoE Adapters for Fine-tuning. arXiv:2503.20633 [cs.LG] https://arxiv.org/abs/2503.20633

work page arXiv 2025
[51]

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny
[52]

InInternational Conference on Learning Representa- tions, B

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. InInternational Conference on Learning Representa- tions, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.), Vol. 2024. 18378–18394. https://proceedings.iclr.cc/paper_files/paper/2024/file/ 50623630a2372839c078474efa6c0cb8-Paper-Conference.pdf

2024
[53]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. 2022.ST-MoE: Designing Stable and Transfer- able Sparse Expert Models. arXiv:2202.08906 [cs.CL] https://arxiv.org/abs/2202. 08906

work page internal anchor Pith review Pith/arXiv arXiv 2022

[1] [1]

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Jun- yang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision- Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv:2308.12966 [cs.CV] https://arxiv.org/abs/2308.12966

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Jun Bai, Minghao Tong, Yang Liu, Zixia Jia, and Zilong Zheng. 2025. Under- standing and Leveraging the Expert Specialization of Context Faithfulness in Mixture-of-Experts LLMs. InProceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 21927–21942. doi:10.18653/v1/2...

work page doi:10.18653/v1/2025.emnlp-main.1114 2025

[3] [3]

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2025. ShareGPT4V: Improving Large Multi-modal Models with Better Captions. InComputer Vision – ECCV 2024, Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol (Eds.). Springer Nature Switzerland, Cham, 370–387

2025

[4] [4]

2024.LLaV A-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs

Shaoxiang Chen, Zequn Jie, and Lin Ma. 2024.LLaV A-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs. arXiv:2401.16160 [cs.CV] https://arxiv.org/abs/2401.16160

work page arXiv 2024

[5] [5]

Tianyu Chen, Xingcheng Fu, Yisen Gao, Haodong Qian, Yuecen Wei, Kun Yan, Haoyi Zhou, and Jianxin Li. 2025. Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4112–4121. doi:10.1109/CVPR52734.2025.00389

work page doi:10.1109/cvpr52734.2025.00389 2025

[6] [6]

XTuner Contributors. 2023. XTuner: A Toolkit for Efficiently Fine-tuning LLM. https://github.com/InternLM/xtuner

2023

[7] [7]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2023. InstructBLIP: To- wards General-purpose Vision-Language Models with Instruction Tuning. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Glober- son, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. C...

2023

[8] [8]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI. 2024.DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of- Experts Language Model. arXiv:2405.04434 [cs.CL] https://arxiv.org/abs/2405. 04434

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, a...

work page arXiv 2022

[10] [10]

Xiaoran Fan, Tao Ji, Changhao Jiang, Shuo Li, Senjie Jin, Sirui Song, Junke Wang, Boyang Hong, Lu Chen, Guodong Zheng, Ming Zhang, Caishuang Huang, Rui Zheng, Zhiheng Xi, Yuhao Zhou, Shihan Dou, Junjie Ye, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, and Yu-Gang Jiang. 2024. MouSi: Poly-Visual-Expert Vision-Language Models. arXiv:24...

work page arXiv 2024

[11] [11]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39. http://jmlr.org/papers/v23/21- 0998.html

2022

[12] [12]

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. 2024. HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vi...

2024

[13] [13]

Hudson and Christopher D

Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2019

[14] [14]

Jacobs, Michael I

Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton

[15] [15]

NeuralComputation3,79–87

Adaptive Mixtures of Local Experts.Neural Computation3, 1 (1991), 79–87. doi:10.1162/neco.1991.3.1.79

work page doi:10.1162/neco.1991.3.1.79 1991

[16] [16]

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Ha- jishirzi, and Ali Farhadi. 2016. A Diagram is Worth a Dozen Images. InComputer Vision – ECCV 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer International Publishing, Cham, 235–251

2016

[17] [17]

2023.Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. 2023.Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints. arXiv:2212.05055 [cs.LG] https://arxiv.org/abs/2212.05055

work page arXiv 2023

[18] [18]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv:2006.16668 [cs.CL] https://arxiv.org/abs/2006.16668

work page internal anchor Pith review Pith/arXiv arXiv 2020

[19] [19]

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. 2024. SEED-Bench: Benchmarking Multimodal Large Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13299–13308

2024

[20] [20]

2025.Aria: An Open Multimodal Native Mixture-of-Experts Model

Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen Huang, Yanpeng Li, Chongyan Zhu, Xiaoyi Ren, Chao Li, Yifan Ye, Peng Liu, Lihuan Zhang, Hanshu Yan, Guoyin Wang, Bei Chen, and Junnan Li. 2025.Aria: An Open Multimodal Native Mixture-of-Experts Model. arXiv:2410.05993 [cs.CV] https://arxiv.org/abs/2410.05993

work page arXiv 2025

[21] [21]

Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, and Longyin Wen. 2024. CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts. arXiv:2405.05949 [cs.CV] https://arxiv.org/ abs/2405.05949

work page arXiv 2024

[22] [22]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. 2023. Evaluating Object Hallucination in Large Vision-Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 292–305. doi:10.18653...

work page doi:10.18653/v1/2023.emnlp-main.20 2023

[23] [23]

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. 2024.Mini-Gemini: Mining the Po- tential of Multi-modality Vision Language Models. arXiv:2403.18814 [cs.CV] https://arxiv.org/abs/2403.18814

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2023. Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions. arXiv:2209.03430 [cs.LG] https://arxiv.org/abs/2209.03430

work page arXiv 2023

[25] [25]

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Bin Lin, Zhenyu Tang, Yang Ye, Jinfa Huang, Junwu Zhang, Yatian Pang, Peng Jin, Munan Ning, Jiebo Luo, and Li Yuan. 2024.MoE-LLaV A: Mixture of Experts for Large Vision-Language Models. arXiv:2401.15947 [cs.CV] https://arxiv.org/ abs/2401.15947

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

2024.MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-A ware Experts

Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Ghosh, Luke Zettlemoyer, and Armen Aghajanyan. 2024.MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-A ware Experts. arXiv:2407.21770 [cs.AI] https://arxiv.org/abs/2407.21770

work page arXiv 2024

[27] [27]

2024.SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Dongyang Liu, Renrui Zhang, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, Yu Qiao, and Peng Gao. 2024.SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models. arXiv:2402.05935 [cs.CV] https://...

work page arXiv 2024

[28] [28]

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved Baselines with Visual Instruction Tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 26296–26306

2024

[29] [29]

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024. LLaVA-NeXT: Improved reasoning, OCR, and world knowl- edge. https://llava-vl.github.io/blog/2024-01-30-llava-next/

2024

[30] [30]

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. 2025. MMBench: Is Your Multi-modal Model an All-Around Player?. In Computer Vision – ECCV 2024, Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol (Eds.). Springer Nature ...

2025

[31] [31]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. 2024.DeepSeek-VL: Towards Real-World Vision- Language Understanding. arXiv:2403.05525 [cs.AI] https://arxiv.org/abs/2403. 05525

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao

[33] [33]

InInternational Conference on Learning Representations, B

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. InInternational Conference on Learning Representations, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.), Vol. 2024. 23439–23554. https://proceedings.iclr.cc/paper_files/paper/2024/file/ 663bce02a0050c4a11f1eb8a7f1429d3-Paper-Conference.pdf

2024

[34] [34]

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. InAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35....

2022

[35] [35]

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, ...

2022

[36] [36]

Minesh Mathew, Dimosthenis Karatzas, and C.V. Jawahar. 2021. DocVQA: A Dataset for VQA on Document Images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 2200–2209. ICMR ’26, June 16–19, 2026, Amsterdam, Netherlands Zijie Zhou et al

2021

[37] [37]

Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Al...

2025

[38] [38]

MistralAITeam. 2023. Mixtral of experts A high quality Sparse Mixture-of-Experts. [EB/OL]. https://mistral.ai/news/mixtral-of-experts/ Accessed December 11, 2023

2023

[39] [39]

OLMoE: Open Mixture-of-Experts Language Models

Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, and Hannaneh Hajishirzi. 2...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

2022.DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Am- inabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022.DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. arXiv:2201.05596 [cs.LG] https://arxiv.org/abs/2201.05596

work page arXiv 2022

[41] [41]

2023.Scaling Vision-Language Models with Sparse Mixture of Experts

Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, and Yux- iong He. 2023.Scaling Vision-Language Models with Sparse Mixture of Experts. arXiv:2303.07226 [cs.CV] https://arxiv.org/abs/2303.07226

work page arXiv 2023

[42] [42]

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA Models That Can Read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2019

[43] [43]

Dianyi Wang, Siyuan Wang, Zejun Li, Yikun Wang, Yitong Li, Duyu Tang, Xiaoyu Shen, Xuanjing Huang, and Zhongyu Wei. 2025. MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models.arXiv:2508.09779 (2025)

work page arXiv 2025

[44] [44]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. 2024.DeepSeek-VL2: Mixture-of-Experts ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

xAI. 2024. Grok-1. Online. https://github.com/xai-org/grok-1

2024

[46] [46]

2024.OpenMoE: An Early Effort on Open Mixture-of-Experts Lan- guage Models

Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. 2024.OpenMoE: An Early Effort on Open Mixture-of-Experts Lan- guage Models. arXiv:2402.01739 [cs.CL] https://arxiv.org/abs/2402.01739

work page arXiv 2024

[47] [47]

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. 2024. mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13040–13051

2024

[48] [48]

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. 2024. MM-Vet: Evaluating Large Mul- timodal Models for Integrated Capabilities. InProceedings of the 41st Interna- tional Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katheri...

2024

[49] [49]

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. 2024. MMMU: A Massive Multi- discipline Multimodal Understanding and Reasoning Benchmark ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

Sashuai Zhou, Hai Huang, and Yan Xia. 2025. Enhancing Multi-modal Models with Heterogeneous MoE Adapters for Fine-tuning. arXiv:2503.20633 [cs.LG] https://arxiv.org/abs/2503.20633

work page arXiv 2025

[51] [51]

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny

[52] [52]

InInternational Conference on Learning Representa- tions, B

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. InInternational Conference on Learning Representa- tions, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.), Vol. 2024. 18378–18394. https://proceedings.iclr.cc/paper_files/paper/2024/file/ 50623630a2372839c078474efa6c0cb8-Paper-Conference.pdf

2024

[53] [53]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. 2022.ST-MoE: Designing Stable and Transfer- able Sparse Expert Models. arXiv:2202.08906 [cs.CL] https://arxiv.org/abs/2202. 08906

work page internal anchor Pith review Pith/arXiv arXiv 2022