pith. machine review for the scientific record. sign in

arxiv: 2604.03342 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: no theorem link

Mixture-of-Experts in Remote Sensing: A Survey

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords mixture of expertsremote sensingsurveydynamic routingearth observationmodel specializationspatiotemporal data
0
0 comments X

The pith

Mixture-of-Experts routes remote sensing inputs to specialized experts to manage sensor diversity and dynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes the first systematic overview of Mixture-of-Experts applications in remote sensing by reviewing core principles, architectural variants, and uses across tasks. It highlights how MoE addresses the challenges of multimodal inputs and changing Earth conditions through selective activation of experts rather than uniform processing. A sympathetic reader would value this because remote sensing data spans optical, radar, and other modalities with high spatiotemporal variability, where dynamic routing can improve both accuracy and efficiency. The survey closes by mapping trends that point toward broader adoption in data-heavy observation workflows.

Core claim

Mixture-of-Experts models address remote sensing challenges by employing a routing mechanism that directs each input to the most relevant subset of specialized expert sub-networks, and the survey synthesizes existing designs and applications to demonstrate this approach across classification, segmentation, detection, and change-analysis tasks.

What carries the argument

Mixture-of-Experts (MoE) model, which uses a gating or routing network to activate only a sparse subset of specialized expert networks for each input.

If this is right

  • Sparse activation in MoE variants lowers compute demands for processing high-volume satellite imagery.
  • Routing strategies can be tuned separately for optical, SAR, and hyperspectral data modalities.
  • Applications extend to time-series analysis and multimodal fusion without requiring full model activation.
  • Future designs may incorporate MoE layers into larger foundation models for Earth observation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • MoE routing patterns could transfer to real-time stream processing of incoming satellite feeds.
  • The surveyed techniques suggest a path for parameter-efficient adaptation across multiple remote sensing sensors.
  • Neighboring domains such as climate data analysis may adopt similar expert specialization for regional variability.

Load-bearing premise

The published body of MoE work in remote sensing is large enough and distinct enough to support a complete, representative survey.

What would settle it

Discovery of several major MoE-based remote sensing methods or papers that the survey omits or fails to synthesize.

Figures

Figures reproduced from arXiv: 2604.03342 by Lajiao Chen, Peng Liu, Yongchuan Cui.

Figure 1
Figure 1. Figure 1: Word cloud of the most frequent words appearing in MoE-related remote sensing papers. multi-task and multi-modal learning, M3ViT [79] integrates MoE layers into Vision Transformers [25] to reduce interference between tasks while keeping inference efficient, Mod-Squad [14] treats experts as reusable modules that can be shared or specialized across tasks, and MoE-based semantic segmentation frameworks use ex… view at source ↗
Figure 2
Figure 2. Figure 2: Basic architecture of Mixture-of-Experts (MoE). ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Mixture-of-Experts applications in remote sensing. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: MoE in adaptive mixture-of-experts distillation (AMoED) [30] for cross-satellite generalizable incremental scene classification. the bias toward frequent categories and markedly improves the recognition performance of rare classes compared with single-backbone baselines. On the NWPU-RESISC45 [15] dataset, DMRS [143] reports 88.3% overall accuracy, exceeding the next best method (MDCS [154] at 81.6%) by 6.7… view at source ↗
Figure 5
Figure 5. Figure 5: MoE in mixture-of-spectral-spatial-experts state space model (MambaMoE) [148] for hyperspectral image classification. class label supervision. Throughout the training process, a shallow style-mixing operation is applied to reduce geospatial and sensor-induced deviations, effectively mitigating domain shift and catastrophic forgetting across satellites. Guo et al. [40] addressed radar target recognition by … view at source ↗
Figure 6
Figure 6. Figure 6: Grid-level MoE backbone used in single model for multi-modal datasets and multi-task object detection (SM3Det) [77] for multi-modal remote sensing object detection. large objects are all well-represented. SAFPN’s [6] multi-level expert design results in mean average precision (mAP) improvements for detection and instance segmentation from 71.3% and 62.4% to 82.7% and 71.1%, respectively, on the Airbus Ship… view at source ↗
Figure 7
Figure 7. Figure 7: Sparse mixture-of-experts for hyperspectral object tracking (HotMoE) [132] framework for hyperspectral object tracking. for final classification. This approach yielded highly accurate burned area maps, as the mixture of experts could adaptively process different aspects of the bi-temporal data while the attention mechanisms enhanced feature representation. The MoE-based siamese model outperformed conventio… view at source ↗
Figure 8
Figure 8. Figure 8: Modality-aware pruning of experts (MAPEX) [43] for multi-modal remote sensing foundation models. 3.4 Multi-Modal Fusion and Adaptation Remote sensing often involves multi-modal data fusion by combining information from different sensors (e.g., optical, radar, LiDAR, multispectral), as well as adapting models across different data domains (e.g., different satellites or geographic regions). MoE models are na… view at source ↗
Figure 9
Figure 9. Figure 9: Phytoplankton absorption mixture-of-experts (PhA-MoE) [141] architecture for hyperspectral retrieval of phytoplankton absorption coefficients. the expert network consists of one shared expert and three special experts, each corresponding to a specific parameter (P-wave velocity, S-wave velocity, and density). Each task has its own gating network that assigns weights to the shared expert and its correspondi… view at source ↗
read the original abstract

Remote sensing data analysis and interpretation present unique challenges due to the diversity in sensor modalities and spatiotemporal dynamics of Earth observation data. Mixture-of-Experts (MoE) model has emerged as a powerful paradigm that addresses these challenges by dynamically routing inputs to specialized experts designed for different aspects of a task. However, despite rapid progress, the community still lacks a comprehensive review of MoE for remote sensing. This survey provides the first systematic overview of MoE applications in remote sensing, covering fundamental principles, architectural designs, and key applications across a variety of remote sensing tasks. The survey also outlines future trends to inspire further research and innovation in applying MoE to remote sensing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to deliver the first systematic survey of Mixture-of-Experts (MoE) models applied to remote sensing, covering fundamental principles, architectural designs, and applications across diverse remote sensing tasks such as image classification, segmentation, and change detection, while also discussing future trends.

Significance. If the survey is comprehensive and representative, it would fill an important gap by synthesizing MoE techniques tailored to remote sensing challenges including multi-modal sensor data and spatiotemporal variability, potentially serving as a reference for researchers bridging computer vision and Earth observation.

major comments (2)
  1. [Abstract and Introduction] Abstract and Introduction: The central claim that this is the 'first systematic overview' is load-bearing but unsupported by any description of the literature search protocol, including databases searched (e.g., IEEE Xplore, Google Scholar), search terms, date range, or inclusion/exclusion criteria. Without this information, it is impossible to evaluate completeness or selection bias.
  2. [Literature Overview] Section 2 or 3 (Literature Overview): The manuscript should include a quantitative summary (e.g., a table or figure) of the number of relevant peer-reviewed papers identified per task category to demonstrate that the MoE-remote sensing corpus is large and distinct enough from generic MoE vision literature to justify a dedicated survey.
minor comments (2)
  1. [Figures] Figure captions and notation: Ensure consistent use of symbols for routing gates and expert outputs across all architectural diagrams to improve readability.
  2. [References] References: Verify that all cited works are from peer-reviewed venues and that recent 2023-2024 publications are included where relevant.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the transparency and justification of our survey. We will revise the manuscript accordingly to address both major points.

read point-by-point responses
  1. Referee: [Abstract and Introduction] Abstract and Introduction: The central claim that this is the 'first systematic overview' is load-bearing but unsupported by any description of the literature search protocol, including databases searched (e.g., IEEE Xplore, Google Scholar), search terms, date range, or inclusion/exclusion criteria. Without this information, it is impossible to evaluate completeness or selection bias.

    Authors: We agree that a detailed description of the literature search protocol is necessary to support the claim of providing the first systematic overview. In the revised manuscript, we will add a dedicated subsection (likely in the Introduction) that explicitly describes the databases searched (IEEE Xplore, Google Scholar, arXiv, Web of Science), the search terms and Boolean combinations used (e.g., 'Mixture-of-Experts' OR MoE AND 'remote sensing' OR 'Earth observation'), the date range covered, and the inclusion/exclusion criteria applied to select papers. This will allow readers to assess completeness and potential selection bias. revision: yes

  2. Referee: [Literature Overview] Section 2 or 3 (Literature Overview): The manuscript should include a quantitative summary (e.g., a table or figure) of the number of relevant peer-reviewed papers identified per task category to demonstrate that the MoE-remote sensing corpus is large and distinct enough from generic MoE vision literature to justify a dedicated survey.

    Authors: We accept this recommendation. In the revised version, we will insert a new table (or figure) in the Literature Overview section that provides a quantitative breakdown of the number of peer-reviewed papers identified per remote sensing task category (e.g., classification, segmentation, change detection, object detection, and multimodal fusion). The table will also note the proportion of papers that focus specifically on remote sensing challenges versus generic vision applications, thereby demonstrating the size and distinctiveness of the MoE-remote sensing corpus. revision: yes

Circularity Check

0 steps flagged

No circularity: survey asserts literature gap without derivations or self-referential reductions

full rationale

The paper is a literature survey containing no equations, predictions, fitted parameters, or first-principles derivations. Its central claim—that it supplies the 'first systematic overview' of MoE in remote sensing—is a descriptive assertion about the external corpus rather than a result obtained by reducing any quantity to its own inputs. No self-citation chains, ansatzes, or uniqueness theorems are invoked to justify internal results. The absence of prior reviews is stated without reference to the authors' own prior work as load-bearing evidence. This matches the default expectation for non-circular survey papers whose claims rest on external literature rather than internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, the central claim rests on the completeness and representativeness of the reviewed literature rather than on any mathematical derivations, fitted parameters, or newly postulated entities. No free parameters, axioms, or invented entities are introduced in the provided abstract.

pith-pipeline@v0.9.0 · 5402 in / 1141 out tokens · 47178 ms · 2026-05-13T20:01:12.639387+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

162 extracted references · 162 canonical work pages · 1 internal anchor

  1. [1]

    Aggarwal, V., Nagarajan, K., and Slatton, K. C. (2004). Multiple-model multiscale data fusion regulated by a mixture-of-experts network. InIGARSS 2004. 2004 IEEE International Geoscience and Remote Sensing Symposium, volume 1. IEEE

  2. [2]

    Albughdadi, M. (2025). Lightweight metadata-aware mixture-of-experts masked autoencoder for earth observation

  3. [3]

    Bi, H., Feng, Y., Tong, B., Wang, M., Yu, H., Mao, Y., Chang, H., Diao, W., Wang, P., Yu, Y., Peng, H., Zhang, Y., Fu, K., and Sun, X. (2025). RingMoE: Mixture-of-modality-experts multi-modal foundation models for universal remote sensing image interpretation

  4. [4]

    Cai, W., Jiang, J., Wang, F., Tang, J., Kim, S., and Huang, J. (2024). A survey on mixture of experts.arXiv preprint

  5. [5]

    Cai, W., Jiang, J., Wang, F., Tang, J., Kim, S., and Huang, J. (2025). A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 37(7):3896–3915

  6. [6]

    Chai, B., Zhou, Q., Nie, X., Qiao, Q., Wu, W., Shi, Y., and Li, X. (2025). Scalable mixture-of-experts attention feature pyramid network for detection and segmentation

  7. [7]

    Chamroukhi, F. (2017). Skew t mixture of experts. Neurocomputing, 266:390–408

  8. [8]

    Chen, B., Chen, K., Yang, M., Zou, Z., and Shi, Z. (2025a). Heterogeneous mixture of experts for remote sensing image super-resolution.IEEE Geoscience and Remote Sensing Letters, 22:1–5

  9. [9]

    K., Liu, S., and Wang, Z

    Chen, T., Zhang, Z., JAISWAL, A. K., Liu, S., and Wang, Z. (2023a). Sparse moe as the new dropout: Scaling dense and self-slimmable transformers. InThe Eleventh International Conference on Learning Representations

  10. [10]

    Chen, X., Yan, S., Zhu, J., Chen, C., Liu, Y., and Zhang, M. (2025b). Generalizable multispectral land cover classification via frequency-aware mixture of low-rank token experts

  11. [11]

    Chen, Y., Cui, H., Zhang, G., Li, X., Xie, Z., Li, H., and Li, D. (2025c). SparseFormer: A credible dual-cnn expert-guided transformer for remote sensing image segmentation with sparse point annotation.IEEE Transactions on Geoscience and Remote Sensing, 63:1–16

  12. [12]

    Chen, Y., Jiang, W., and Wang, Y. (2025d). FAMHE-Net: Multi-scale feature augmentation and mixture of heterogeneous experts for oriented object detection.Remote Sensing, 17(2):205

  13. [13]

    Chen, Z., Deng, Y., Wu, Y., Gu, Q., and Li, Y. (2022). Towards understanding the mixture-of-experts layer in deep learning. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors,Advances in Neural Information Processing Systems, volume 35, pages 23049–23062. Curran Associates, Inc

  14. [14]

    Chen, Z., Shen, Y., Ding, M., Chen, Z., Zhao, H., Learned-Miller, E., and Gan, C. (2023b). Mod-Squad: Designing mixtures of experts as modular multi-task learners. In2023IEEE/CVFConferenceonComputerVision and Pattern Recognition (CVPR), pages 11828–11837

  15. [15]

    Cheng, G., Han, J., and Lu, X. (2017). Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883

  16. [16]

    Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., Xie, Z., Li, Y., Huang, P., Luo, F., Ruan, C., Sui, Z., and Liang, W. (2024). DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Proceedings of the 62nd Annual Meeting of th...

  17. [17]

    Dai, X., Li, Z., Li, L., Xue, S., Huang, X., and Yang, X. (2025). HyperTransXNet: learning both global and local dynamics with a dual dynamic token mixer for hyperspectral image classification.Remote Sensing, 17(14):2361

  18. [18]

    and Gu, A

    Dao, T. and Gu, A. (2024). Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InInternational Conference on Machine Learning (ICML)

  19. [19]

    Statisticalcomparisonsofclassifiers over multiple data sets.J

    Demšar,J.(2006). Statisticalcomparisonsofclassifiers over multiple data sets.J. Mach. Learn. Res., 7:1–30

  20. [20]

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T., editors,Proceedings of 28 Journal of Geoscience and Remote Sensing the 2019 Conference of the North American Chapter of the AssociationforComputationalLinguistics: Human...

  21. [21]

    Dimitri,V.,Regina,B.,andAlfonz,M.(2025).Asurvey on mixture of experts: Advancements, challenges, and future directions.TechRxiv Preprints

  22. [22]

    Ding, L., Hong, D., Zhao, M., Chen, H., Li, C., Deng, J., Yokoya, N., Bruzzone, L., and Chanussot, J. (2025). A survey of sample-efficient deep learning for change detection in remote sensing: Tasks, strategies, and challenges.IEEE Geoscience and Remote Sensing Magazine, 13(3):164–189

  23. [23]

    Do, G., Le, H., and Tran, T. (2025). SimSMoE: Toward efficient training mixture of experts via solving representational collapse. In Chiruzzo, L., Ritter, A., and Wang, L., editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 2012–2025, Albuquerque, New Mexico. Association for Computational Linguistics

  24. [24]

    Dong, Z., Sun, Y., Jiang, H., Liu, T., and Gu, Y. (2025). PhyDAE: Physics-guided degradation-adaptive experts for all-in-one remote sensing image restoration

  25. [25]

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR)

  26. [26]

    Dou, P., Shen, H., Li, Z., and Guan, X. (2021). Time series remote sensing image classification framework using combination of deep learning and multiple classifiers system.International Journal of Applied Earth Observation and Geoinformation, 103:102477

  27. [27]

    Dror, R., Baumer, G., Shlomov, S., and Reichart, R. (2018). The hitchhiker’s guide to testing statistical significanceinnaturallanguageprocessing.InGurevych, I. and Miyao, Y., editors,Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1383–1392, Melbourne, Australia. Association for Compu...

  28. [28]

    M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A

    Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A. W., Firat, O., Zoph, B., Fedus, L., Bosma, M. P., Zhou, Z., Wang, T., Wang, E., Webster, K., Pellat, M., Robinson, K., Meier-Hellstern, K., Duke, T., Dixon, L., Zhang, K., Le, Q., Wu, Y., Chen, Z., and Cui, C. (2022). GLaM: Efficient scaling of language models with...

  29. [29]

    Fedus, W., Zoph, B., and Shazeer, N. (2022). Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39

  30. [30]

    Fu, Y., Yang, R., Liu, Z., and Ng, M. K. (2025). Adaptive mixture-of-experts distillation for cross-satellite generalizable incremental remote sensing scene classification.IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1

  31. [31]

    Fung,T.C.andTseung,S.C.(2025).Mixtureofexperts models for multilevel data: Modeling framework and approximation theory.Neurocomputing, 626:129357

  32. [32]

    Gale, T., Elsen, E., and Hooker, S. (2023). MegaBlocks: Efficient sparse training with mixture-of-experts.arXiv preprint

  33. [33]

    Gan, W., Ning, Z., Qi, Z., and Yu, P. S. (2025). Mixture of experts (MoE): A big data perspective.arXiv preprint

  34. [34]

    Asurvey on deep learning for multimodal data fusion.Neural Computation, 32(5):829–864

    Gao,J.,Li,P.,Chen,Z.,andZhang,J.(2020). Asurvey on deep learning for multimodal data fusion.Neural Computation, 32(5):829–864

  35. [35]

    Gao, Q., Qu, J., Li, Y., and Dong, W. (2025a). Rethinking efficient mixture-of-experts for remote sensing modality-missing classification

  36. [36]

    ToMoE:Convertingdense large language models to mixture-of-experts through dynamic structural pruning

    Gao,S.,Hua,T.,Shirkavand,R.,Lin,C.-H.,Tang,Z.,Li, Z., Yuan, L., Li, F., Zhang, Z., Ganjdanesh, A., Qian, L., Jie,X.,andHsu,Y.-C.(2025b). ToMoE:Convertingdense large language models to mixture-of-experts through dynamic structural pruning

  37. [37]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Gu, A. and Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752

  38. [38]

    Gu, N., Zhang, Z., Feng, Y., Chen, Y., Fu, P., Lin, Z., Wang, S., Sun, Y., Wu, H., Wang, W., and Wang, H. (2025). Elastic MoE: Unlocking the inference-time scalability of mixture-of-experts

  39. [39]

    Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 1321–1330. JMLR.org

  40. [40]

    Guo, S., Chen, T., Wang, P., Yan, J., and Liu, H. (2025). Confidence fusion with representation distribution and mixture of experts for multimodal radar target recognition.IEEE Transactions on Aerospace and Electronic Systems, 61(5):13251–13268

  41. [41]

    H., and Gao, J

    Gupta, S., Mukherjee, S., Subudhi, K., Gonzalez, E., Jose, D., Awadallah, A. H., and Gao, J. (2022). Sparsely activated mixture-of-experts are robust multi-task learners.arXiv preprint

  42. [42]

    A., and Zettlemoyer, L

    Gururangan, S., Li, M., Lewis, M., Shi, W., Althoff, T., Smith, N. A., and Zettlemoyer, L. (2023). Scaling expert language models with unsupervised domain discovery

  43. [43]

    Hanna, J., Scheibenreif, L., and Borth, D. (2025). MAPEX: Modality-aware pruning of experts for remote sensing foundation models

  44. [44]

    Hazimeh,H.,Zhao,Z.,Chowdhery,A.,Sathiamoorthy, M., Chen, Y., Mazumder, R., Hong, L., and Chi, E. (2021). DSelect-k: Differentiable selection in the mixture of experts with applications to multi-task learning. In Ranzato, M., Beygelzimer, A., Dauphin, 29 Journal of Geoscience and Remote Sensing Y., Liang, P., and Vaughan, J. W., editors,Advances in Neural ...

  45. [45]

    He, J., Qiu, J., Zeng, A., Yang, Z., Zhai, J., and Tang, J. (2021). FastMoE: A fast mixture-of-expert training system.arXiv preprint

  46. [46]

    He, J., Zhai, J., Antunes, T., Wang, H., Luo, F., Shi, S., and Li, Q. (2022). FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 120–134

  47. [47]

    He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778

  48. [48]

    He, S., Cheng, Q., Huai, Y., Zhu, Z., and Ding, J. (2024a). Mixture-of-experts for semantic segmentation of remoting sensing image. In Qin, C. and Zhou, H., editors,International Conference on Image Processing and Artificial Intelligence (ICIPAl 2024), volume 13213, page 132131Z. International Society for Optics and Photonics, SPIE

  49. [49]

    He, W., Cai, Y., Ren, Q., Ruze, A., and Jia, S. (2025). Adaptive expert learning for hyperspectral and multispectral image fusion.IEEE Transactions on Geoscience and Remote Sensing, 63:1–15

  50. [50]

    He, X., Yan, K., Li, R., Xie, C., Zhang, J., and Zhou, M. (2024b). Frequency-adaptive pan-sharpening with mixture of experts. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 2121–2129

  51. [51]

    Ho, N., Yang, C.-Y., and Jordan, M. I. (2022). Convergence rates for gaussian mixtures of experts. Journal of Machine Learning Research, 23(323):1–81

  52. [52]

    and Schmidhuber, J

    Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory.Neural computation, 9(8):1735–1780

  53. [53]

    J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W

    Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2022). LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations

  54. [54]

    Harder task needs more experts: Dynamic routing in MoE models

    Huang, Q., An,Z., Zhuang, N., Tao, M., Zhang, C., Jin, Y.,Xu,K.,Xu,K.,Chen,L.,Huang,S.,andFeng,Y.(2024). Harder task needs more experts: Dynamic routing in MoE models. In Ku, L.-W., Martins, A., and Srikumar, V., editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12883–12895, Bang...

  55. [55]

    Hwang,C.,Cui,W.,Xiong,Y.,Yang,Z.,Liu,Z.,Hu,H., Wang, Z., Salas, R., Jose, J., Ram, P., Chau, H., Cheng, P., Yang, F., Yang, M., and Xiong, Y. (2023). Tutel: Adaptive mixture-of-experts at scale. In Song, D., Carbin, M., and Chen, T., editors,Proceedings of Machine Learning and Systems, volume 5, pages 269–287. Curan

  56. [56]

    Hwang, R., Wei, J., Cao, S., Hwang, C., Tang, X., Cao, T., and Yang, M. (2024). Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 1018–1031

  57. [57]

    A., Jordan, M

    Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixtures of local experts.Neural Computation, 3(1):79–87

  58. [58]

    J., Abdul-Mageed, M., Lakshmanan, V.S., L., Awadallah, A

    Jawahar, G., Mukherjee, S., Liu, X., Kim, Y. J., Abdul-Mageed, M., Lakshmanan, V.S., L., Awadallah, A. H., Bubeck, S., and Gao, J. (2023). AutoMoE: Heterogeneous mixture-of-experts with adaptive computation for efficient neural machine translation. In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors,Findings of the Association for Computational Lingu...

  59. [59]

    Jia, Y., Ge, Y., Ling, F., Guo, X., Wang, J., Wang, L., Chen, Y., and Li, X. (2018). Urban land use mapping by combining remote sensing imagery and mobile phone positioning data.Remote Sensing, 10(3)

  60. [60]

    Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D

    Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., de las Casas, D., Bou Hanna, E., Bressand, F., et al. (2024). Mixtral of experts.arXiv preprint

  61. [61]

    D., Feng, D., and Ku, W.-S

    Jiang, C., Osei, K., Yeddula, S. D., Feng, D., and Ku, W.-S. (2025). Knowledge-guided adaptive mixture of experts for precipitation prediction

  62. [62]

    Jiang, H., Peng, M., Zhong, Y., Xie, H., Hao, Z., Lin, J., Ma, X., and Hu, X. (2022). A survey on deep learning-based change detection from high-resolution remote sensing images.Remote Sensing, 14(7)

  63. [63]

    and Tanner, M

    Jiang, W. and Tanner, M. A. (1999). Hierarchical mixtures-of-experts for generalized linear models: Some results on denseness and consistency. In Heckerman, D. and Whittaker, J., editors,Proceedings of the Seventh International Workshop on Artificial Intelligence and Statistics, volume R2 ofProceedings of Machine Learning Research. PMLR. Reissued by PMLR ...

  64. [64]

    Jordan, M. I. and Jacobs, R. A. (1994). Hierarchical mixtures of experts and the em algorithm.Neural Computation, 6(2):181–214

  65. [65]

    R., Mustafa, B., Ainslie, J., Tay, Y., Dehghani, M., and Houlsby, N

    Komatsuzaki, A., Puigcerver, J., Lee-Thorp, J., Ruiz, C. R., Mustafa, B., Ainslie, J., Tay, Y., Dehghani, M., and Houlsby, N. (2023). Sparse Upcycling: Training mixture-of-experts from dense checkpoints. InThe Eleventh International Conference on Learning Representations

  66. [66]

    L., and Wang, X

    Kong, Y., Yu, S., Cheng, Y., Philip Chen, C. L., and Wang, X. (2025). Joint classification of hyperspectral images and lidar data based on candidate pseudo labels pruning and dual mixture of experts.IEEE Transactions on Geoscience and Remote Sensing, 63:1–12

  67. [67]

    Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. InAdvances in Neural Information Processing 30 Journal of Geoscience and Remote Sensing Systems, volume 25

  68. [68]

    Kudugunta, S., Huang, Y., Bapna, A., Krikun, M., Lepikhin, D., Luong, M.-T., and Firat, O. (2021). Beyond Distillation: Task-level mixture-of-experts for efficient inference. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t., editors,Findings of the Association for ComputationalLinguistics: EMNLP2021,pages3577–3599, Punta Cana, Dominican Republic....

  69. [69]

    N., Gupta, M., Abdelsalam, M., and Bhattarai, M

    Kunwar, P., Vu, M. N., Gupta, M., Abdelsalam, M., and Bhattarai, M. (2025). TT-LoRA MoE: Unifying parameter-efficient fine-tuning and sparse mixture-of-experts.arXiv preprint arXiv:2504.21190

  70. [70]

    Kussul, N., Shelestov, A., Lavreniuk, M., Butko, I., and Skakun, S. (2016). Deep learning approach for large scale land cover mapping based on remote sensing data fusion. In2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pages 198–201

  71. [71]

    Lee, S., Park, S., Yang, J., Kim, J., and Cha, M. (2025). Generalizableslumdetectionfromsatelliteimagerywith mixture-of-experts

  72. [72]

    GShard: Scaling giant models with conditional computation and automatic sharding

    Lepikhin,D.,Lee,H.,Xu,Y.,Chen,D.,Firat,O.,Huang, Y.,Krikun,M.,Shazeer,N.,andChen,Z.(2021). GShard: Scaling giant models with conditional computation and automatic sharding. InInternational Conference on Learning Representations

  73. [73]

    In Meila, M

    Lewis, M., Bhosale, S., Dettmers, T., Goyal, N., and Zettlemoyer,L.(2021).BASElayers: Simplifyingtraining of large, sparse models. In Meila, M. and Zhang, T., editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 6265–6274. PMLR

  74. [74]

    Li, J., Kang, J., Lu, J., Fu, H., Li, Z., Liu, B., Lin, X., Zhao, J., Guan, H., Liu, H., and Liu, Z. (2025a). Dynamic gating-enhanced deep learning model with multi-source remote sensing synergy for optimizing wheatyieldestimation.FrontiersinPlantScience,Volume 16 - 2025

  75. [75]

    Li, J., Li, D., Savarese, S., and Hoi, S. (2023). BLIP-2: bootstrappinglanguage-imagepre-trainingwith frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org

  76. [76]

    Li, R., Ding, X., Peng, S., and Cai, F. (2025b). U-MoEMamba: A hybrid expert segmentation model for cabbage heads in complex uav low-altitude remote sensing scenarios.Agriculture, 15(16):1723

  77. [77]

    Li, Y., Li, X., Li, Y., Zhang, Y., Dai, Y., Hou, Q., Cheng, M.-M., and Yang, J. (2025c). SM3Det: A unified model for multi-modal remote sensing object detection

  78. [78]

    Li, Z., Chen, X., Li, J., and Zhang, J. (2022). Pertinent multigate mixture-of-experts-based prestack three-parameter seismic inversion.IEEE Transactions on Geoscience and Remote Sensing, 60:1–15

  79. [79]

    Liang, H., Fan, Z., Sarkar, R., Jiang, Z., Chen, T., Zou, K., Cheng, Y., Hao, C., and Wang, Z. (2022). M3vit: mixture-of-experts vision transformer for efficient multi-task learning with model-accelerator co-design. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA. Curran Associates Inc

  80. [80]

    Liao, M., Chen, W., Shen, J., Guo, S., and Wan, H. (2025). HMoRA: Making llms more effective with hierarchical mixture of lora experts. InThe Thirteenth International Conference on Learning Representations

Showing first 80 references.