Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models
Pith reviewed 2026-06-28 22:41 UTC · model grok-4.3
The pith
AsyMoE models vision-language asymmetry with hyperbolic geometry and evidence-priority experts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AsyMoE is a novel architecture that explicitly models the asymmetry in vision-language processing through three specialized expert groups. Intra-modality experts handle modality-specific processing. Hyperbolic inter-modality experts capture hierarchical cross-modal relationships through negative curvature geometry. Evidence-priority language experts suppress parametric memory activation and maintain contextual grounding throughout network depth. This design leads to consistent improvements over baseline methods.
What carries the argument
AsyMoE architecture with three specialized expert groups: intra-modality experts, hyperbolic inter-modality experts using negative curvature geometry, and evidence-priority language experts.
If this is right
- Achieves average gains of 1.5% over MoE variants on multimodal tasks.
- Improves up to 3.8% on hallucination-sensitive tasks.
- Activates 25.45% fewer parameters compared to dense models.
- Maintains contextual grounding in language experts across all network depths.
Where Pith is reading between the lines
- The hyperbolic component could be tested on other containment-heavy tasks such as visual question answering with nested objects.
- Evidence prioritization might reduce over-reliance on training data in text-only models without multimodal input.
- The three-group split suggests a general template for handling asymmetric modalities in future multimodal architectures.
Load-bearing premise
The premise that text and vision form hierarchical containment relationships that Euclidean expert spaces cannot encode and that language experts in deeper layers necessarily lose grounding in the provided context.
What would settle it
A controlled test where replacing the hyperbolic inter-modality experts with Euclidean ones removes the reported gains on tasks involving scene containment, or disabling evidence-priority experts increases hallucination rates in deeper layers.
Figures
read the original abstract
Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. Recent studies introduce Mixture of Experts (MoE) into LVLMs for improved computational efficiency. However, existing MoE approaches treat visual and linguistic modalities with symmetric architectures, overlooking the inherent asymmetry in how these two modalities are processed. This asymmetry causes two critical issues. First, text and vision form hierarchical rather than parallel relationships, as text queries typically describe partial aspects of complete visual scenes. Euclidean expert space struggles to encode such containment structures. Second, language experts in deeper layers progressively shift from evidence-based processing to parametric memory dependence, losing grounding in the provided visual and linguistic information. To address these issues, we propose AsyMoE, a novel architecture that explicitly models this asymmetry through three specialized expert groups. Intra-modality experts handle modality-specific processing. Hyperbolic inter-modality experts capture hierarchical cross-modal relationships through negative curvature geometry. Evidence-priority language experts suppress parametric memory activation and maintain contextual grounding throughout network depth. Extensive experiments demonstrate that AsyMoE achieves consistent improvements over baseline methods, with average gains of 1.5\% over MoE variants and up to 3.8\% on hallucination-sensitive tasks. AsyMoE activates 25.45\% fewer parameters compared to dense models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents AsyMoE, a novel MoE architecture for large vision-language models that uses three expert groups: intra-modality experts, hyperbolic inter-modality experts to capture hierarchical cross-modal relationships, and evidence-priority language experts to maintain contextual grounding. It claims consistent improvements of 1.5% over MoE variants and up to 3.8% on hallucination-sensitive tasks, while activating 25.45% fewer parameters than dense models.
Significance. The proposed architecture addresses a potentially important asymmetry in how visual and linguistic modalities are processed in LVLMs. The use of hyperbolic geometry for containment structures and mechanisms to reduce parametric memory dependence could lead to more efficient and less hallucinatory multimodal models if the results are substantiated. The reported parameter savings are a notable strength.
major comments (1)
- [Abstract] Abstract: The abstract states performance numbers (1.5% average gain, 3.8% on hallucination tasks) but supplies no information on experimental setup, baselines, statistical tests, or ablation controls, making it impossible to assess whether the data support the claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of AsyMoE's potential to address modality asymmetry. We address the single major comment point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract states performance numbers (1.5% average gain, 3.8% on hallucination tasks) but supplies no information on experimental setup, baselines, statistical tests, or ablation controls, making it impossible to assess whether the data support the claims.
Authors: We acknowledge that the abstract, due to length constraints, focuses on high-level claims without experimental details. The full manuscript provides these in Section 4 (experimental setup with baselines including standard MoE variants and dense models, datasets, and evaluation protocols) and Section 5 (ablation studies and statistical significance via multiple runs with variance). To address the concern, we will revise the abstract to include a concise reference to the evaluation framework and key controls. revision: partial
Circularity Check
No significant circularity detected
full rationale
The abstract presents a motivation for asymmetric MoE handling of vision-language modalities and reports empirical gains, but contains no equations, fitted parameters, derivations, or self-citations. No derivation chain exists that could reduce to inputs by construction. The full manuscript is referenced externally but the provided text shows a self-contained empirical proposal without load-bearing mathematical reductions or imported uniqueness claims.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Jun- yang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision- Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv:2308.12966 [cs.CV] https://arxiv.org/abs/2308.12966
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Jun Bai, Minghao Tong, Yang Liu, Zixia Jia, and Zilong Zheng. 2025. Under- standing and Leveraging the Expert Specialization of Context Faithfulness in Mixture-of-Experts LLMs. InProceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 21927–21942. doi:10.18653/v1/2...
-
[3]
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2025. ShareGPT4V: Improving Large Multi-modal Models with Better Captions. InComputer Vision – ECCV 2024, Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol (Eds.). Springer Nature Switzerland, Cham, 370–387
2025
-
[4]
Shaoxiang Chen, Zequn Jie, and Lin Ma. 2024.LLaV A-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs. arXiv:2401.16160 [cs.CV] https://arxiv.org/abs/2401.16160
-
[5]
Tianyu Chen, Xingcheng Fu, Yisen Gao, Haodong Qian, Yuecen Wei, Kun Yan, Haoyi Zhou, and Jianxin Li. 2025. Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4112–4121. doi:10.1109/CVPR52734.2025.00389
-
[6]
XTuner Contributors. 2023. XTuner: A Toolkit for Efficiently Fine-tuning LLM. https://github.com/InternLM/xtuner
2023
-
[7]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2023. InstructBLIP: To- wards General-purpose Vision-Language Models with Instruction Tuning. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Glober- son, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. C...
2023
-
[8]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-AI. 2024.DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of- Experts Language Model. arXiv:2405.04434 [cs.CL] https://arxiv.org/abs/2405. 04434
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, a...
-
[10]
Xiaoran Fan, Tao Ji, Changhao Jiang, Shuo Li, Senjie Jin, Sirui Song, Junke Wang, Boyang Hong, Lu Chen, Guodong Zheng, Ming Zhang, Caishuang Huang, Rui Zheng, Zhiheng Xi, Yuhao Zhou, Shihan Dou, Junjie Ye, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, and Yu-Gang Jiang. 2024. MouSi: Poly-Visual-Expert Vision-Language Models. arXiv:24...
-
[11]
William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39. http://jmlr.org/papers/v23/21- 0998.html
2022
-
[12]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. 2024. HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vi...
2024
-
[13]
Hudson and Christopher D
Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
2019
-
[14]
Jacobs, Michael I
Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton
-
[15]
Adaptive Mixtures of Local Experts.Neural Computation3, 1 (1991), 79–87. doi:10.1162/neco.1991.3.1.79
-
[16]
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Ha- jishirzi, and Ali Farhadi. 2016. A Diagram is Worth a Dozen Images. InComputer Vision – ECCV 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer International Publishing, Cham, 235–251
2016
-
[17]
2023.Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. 2023.Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints. arXiv:2212.05055 [cs.LG] https://arxiv.org/abs/2212.05055
-
[18]
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv:2006.16668 [cs.CL] https://arxiv.org/abs/2006.16668
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[19]
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. 2024. SEED-Bench: Benchmarking Multimodal Large Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13299–13308
2024
-
[20]
2025.Aria: An Open Multimodal Native Mixture-of-Experts Model
Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen Huang, Yanpeng Li, Chongyan Zhu, Xiaoyi Ren, Chao Li, Yifan Ye, Peng Liu, Lihuan Zhang, Hanshu Yan, Guoyin Wang, Bei Chen, and Junnan Li. 2025.Aria: An Open Multimodal Native Mixture-of-Experts Model. arXiv:2410.05993 [cs.CV] https://arxiv.org/abs/2410.05993
- [21]
-
[22]
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. 2023. Evaluating Object Hallucination in Large Vision-Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 292–305. doi:10.18653...
-
[23]
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. 2024.Mini-Gemini: Mining the Po- tential of Multi-modality Vision Language Models. arXiv:2403.18814 [cs.CV] https://arxiv.org/abs/2403.18814
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [24]
-
[25]
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Bin Lin, Zhenyu Tang, Yang Ye, Jinfa Huang, Junwu Zhang, Yatian Pang, Peng Jin, Munan Ning, Jiebo Luo, and Li Yuan. 2024.MoE-LLaV A: Mixture of Experts for Large Vision-Language Models. arXiv:2401.15947 [cs.CV] https://arxiv.org/ abs/2401.15947
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
2024.MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-A ware Experts
Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Ghosh, Luke Zettlemoyer, and Armen Aghajanyan. 2024.MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-A ware Experts. arXiv:2407.21770 [cs.AI] https://arxiv.org/abs/2407.21770
-
[27]
2024.SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Dongyang Liu, Renrui Zhang, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, Yu Qiao, and Peng Gao. 2024.SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models. arXiv:2402.05935 [cs.CV] https://...
-
[28]
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved Baselines with Visual Instruction Tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 26296–26306
2024
-
[29]
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024. LLaVA-NeXT: Improved reasoning, OCR, and world knowl- edge. https://llava-vl.github.io/blog/2024-01-30-llava-next/
2024
-
[30]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. 2025. MMBench: Is Your Multi-modal Model an All-Around Player?. In Computer Vision – ECCV 2024, Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol (Eds.). Springer Nature ...
2025
-
[31]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. 2024.DeepSeek-VL: Towards Real-World Vision- Language Understanding. arXiv:2403.05525 [cs.AI] https://arxiv.org/abs/2403. 05525
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao
-
[33]
InInternational Conference on Learning Representations, B
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. InInternational Conference on Learning Representations, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.), Vol. 2024. 23439–23554. https://proceedings.iclr.cc/paper_files/paper/2024/file/ 663bce02a0050c4a11f1eb8a7f1429d3-Paper-Conference.pdf
2024
-
[34]
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. InAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35....
2022
-
[35]
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, ...
2022
-
[36]
Minesh Mathew, Dimosthenis Karatzas, and C.V. Jawahar. 2021. DocVQA: A Dataset for VQA on Document Images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 2200–2209. ICMR ’26, June 16–19, 2026, Amsterdam, Netherlands Zijie Zhou et al
2021
-
[37]
Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Al...
2025
-
[38]
MistralAITeam. 2023. Mixtral of experts A high quality Sparse Mixture-of-Experts. [EB/OL]. https://mistral.ai/news/mixtral-of-experts/ Accessed December 11, 2023
2023
-
[39]
OLMoE: Open Mixture-of-Experts Language Models
Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, and Hannaneh Hajishirzi. 2...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Am- inabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022.DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. arXiv:2201.05596 [cs.LG] https://arxiv.org/abs/2201.05596
-
[41]
2023.Scaling Vision-Language Models with Sparse Mixture of Experts
Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, and Yux- iong He. 2023.Scaling Vision-Language Models with Sparse Mixture of Experts. arXiv:2303.07226 [cs.CV] https://arxiv.org/abs/2303.07226
-
[42]
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA Models That Can Read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
2019
- [43]
-
[44]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. 2024.DeepSeek-VL2: Mixture-of-Experts ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
xAI. 2024. Grok-1. Online. https://github.com/xai-org/grok-1
2024
-
[46]
2024.OpenMoE: An Early Effort on Open Mixture-of-Experts Lan- guage Models
Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. 2024.OpenMoE: An Early Effort on Open Mixture-of-Experts Lan- guage Models. arXiv:2402.01739 [cs.CL] https://arxiv.org/abs/2402.01739
-
[47]
Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. 2024. mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13040–13051
2024
-
[48]
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. 2024. MM-Vet: Evaluating Large Mul- timodal Models for Integrated Capabilities. InProceedings of the 41st Interna- tional Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katheri...
2024
-
[49]
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. 2024. MMMU: A Massive Multi- discipline Multimodal Understanding and Reasoning Benchmark ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [50]
-
[51]
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny
-
[52]
InInternational Conference on Learning Representa- tions, B
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. InInternational Conference on Learning Representa- tions, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.), Vol. 2024. 18378–18394. https://proceedings.iclr.cc/paper_files/paper/2024/file/ 50623630a2372839c078474efa6c0cb8-Paper-Conference.pdf
2024
-
[53]
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. 2022.ST-MoE: Designing Stable and Transfer- able Sparse Expert Models. arXiv:2202.08906 [cs.CL] https://arxiv.org/abs/2202. 08906
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.