Recognition: unknown
Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation
Pith reviewed 2026-05-10 13:34 UTC · model grok-4.3
The pith
Patch-wise sparse mixture-of-experts layers improve CNN semantic segmentation accuracy by up to 3.9 mIoU with low added computation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A patch-wise formulation of sparse MoE layers, where local regions are routed to a small subset of convolutional experts, can be integrated into CNN-based semantic segmentation models. When inserted into encoder-decoder and backbone architectures and tested on Cityscapes and BDD100K, these layers produce architecture-dependent improvements of up to +3.9 mIoU with only minor computational overhead, while exposing clear patterns of expert specialization and dynamic routing.
What carries the argument
patch-wise sparse MoE layer in which local image regions are routed to a small subset of convolutional experts
If this is right
- Gains remain consistent yet vary between encoder-decoder and backbone-only CNN designs.
- Routing dynamics and expert specialization respond strongly to architectural choices.
- The added capacity produces only small increases in computational cost.
- Similar patterns appear on both Cityscapes and BDD100K datasets.
Where Pith is reading between the lines
- The same patch-wise routing approach could be tested on other dense prediction tasks such as instance segmentation or monocular depth estimation.
- High design sensitivity suggests that automated architecture search might locate even better MoE configurations.
- If expert specialization generalizes, the method could help adapt segmentation models to new domains with limited retraining.
Load-bearing premise
The reported mIoU gains are produced by the patch-wise MoE routing and resulting expert specialization rather than by other training details, hyperparameter choices, or dataset-specific effects.
What would settle it
Retraining the identical baseline CNN architectures with the same training protocol and hyperparameters but without any MoE layers and observing whether the mIoU difference disappears or reverses.
Figures
read the original abstract
Sparse mixture-of-experts (MoE) layers have been shown to substantially increase model capacity without a proportional increase in computational cost and are widely used in transformer architectures, where they typically replace feed-forward network blocks. In contrast, integrating sparse MoE layers into convolutional neural networks (CNNs) remains inconsistent, with most prior work focusing on fine-grained MoEs operating at the filter or channel levels. In this work, we investigate a coarser, patch-wise formulation of sparse MoE layers for semantic segmentation, where local regions are routed to a small subset of convolutional experts. Through experiments on the Cityscapes and BDD100K datasets using encoder-decoder and backbone-based CNNs, we conduct a design analysis to assess how architectural choices affect routing dynamics and expert specialization. Our results demonstrate consistent, architecture-dependent improvements (up to +3.9 mIoU) with little computational overhead, while revealing strong design sensitivity. Our work provides empirical insights into the design and internal dynamics of sparse MoE layers in CNN-based dense prediction. Our code is available at https://github.com/KASTEL-MobilityLab/moe-layers/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that integrating a coarser patch-wise sparse MoE formulation into CNNs for semantic segmentation yields architecture-dependent mIoU gains (up to +3.9) on Cityscapes and BDD100K with low computational overhead. It contrasts this with prior fine-grained MoE work, reports experiments across encoder-decoder and backbone CNNs, and provides a design analysis of routing dynamics and expert specialization, concluding that performance is highly sensitive to architectural choices.
Significance. If the causal attribution holds, the work supplies empirical guidance on adapting MoE to CNN-based dense prediction, an area less mature than transformer MoE. The design-sensitivity findings and public code could help practitioners implement efficient segmentation models and avoid common pitfalls in routing and expert configuration.
major comments (3)
- [§4] §4 (Experimental Setup and Results): The manuscript does not state whether MoE-augmented models and baselines were trained under identical hyperparameter regimes, optimizer choices, or total training iterations. This is load-bearing for the central claim, because any incidental training differences could produce the reported mIoU deltas independently of patch-wise routing or expert specialization.
- [Results tables] Results tables (e.g., those reporting per-architecture mIoU): No capacity-matched controls (dense model with equivalent parameter count or FLOPs) or single-expert ablations are described. Without these, the +3.9 mIoU improvement and the attribution to sparse routing cannot be isolated from simple increases in effective capacity.
- [§5] §5 (Design Analysis): The claim of “strong design sensitivity” is supported only by qualitative observations; quantitative metrics of expert utilization, load balance, or specialization (e.g., entropy of routing distributions or inter-expert feature diversity) are not reported, weakening the behavioral insights that accompany the performance numbers.
minor comments (2)
- [Abstract] The abstract states “consistent, architecture-dependent improvements” but does not list the per-architecture deltas; adding a one-sentence summary would improve clarity.
- Figure captions describing routing visualizations should explicitly state the color scale or metric used so readers can interpret expert activation patterns without referring to the main text.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to strengthen the experimental rigor and analysis.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Setup and Results): The manuscript does not state whether MoE-augmented models and baselines were trained under identical hyperparameter regimes, optimizer choices, or total training iterations. This is load-bearing for the central claim, because any incidental training differences could produce the reported mIoU deltas independently of patch-wise routing or expert specialization.
Authors: We confirm that all models (MoE-augmented and baselines) were trained under identical conditions using the same optimizer (AdamW), learning rate schedule, batch size, total iterations, and data augmentations as described in the experimental protocol. This equivalence was followed throughout but not explicitly reiterated for the MoE variants. We have added a dedicated clarification paragraph in the revised §4 stating that identical hyperparameter regimes were used for all compared models. revision: yes
-
Referee: [Results tables] Results tables (e.g., those reporting per-architecture mIoU): No capacity-matched controls (dense model with equivalent parameter count or FLOPs) or single-expert ablations are described. Without these, the +3.9 mIoU improvement and the attribution to sparse routing cannot be isolated from simple increases in effective capacity.
Authors: We agree that capacity-matched controls and single-expert ablations are necessary to isolate the contribution of sparse routing. In the revised manuscript we have added new experiments: (i) dense baselines whose channel widths were scaled to match the total parameter count and FLOPs of the corresponding MoE models, and (ii) single-expert ablations in which routing is replaced by a fixed assignment to one expert. The updated results tables show that the reported mIoU gains remain after these controls, supporting attribution to the patch-wise sparse mechanism. revision: yes
-
Referee: [§5] §5 (Design Analysis): The claim of “strong design sensitivity” is supported only by qualitative observations; quantitative metrics of expert utilization, load balance, or specialization (e.g., entropy of routing distributions or inter-expert feature diversity) are not reported, weakening the behavioral insights that accompany the performance numbers.
Authors: We acknowledge that the original §5 relied on qualitative routing visualizations. To provide quantitative support we have added the following metrics to the revised design analysis: entropy of the per-patch routing distributions, coefficient of variation for expert load balance, and average cosine similarity between expert output features as a measure of specialization. These statistics are now reported for each architectural variant and corroborate the observed design sensitivity. revision: yes
Circularity Check
No circularity: purely empirical study with no derivations or predictions
full rationale
The paper reports experimental results on inserting patch-wise sparse MoE layers into CNN encoders/decoders for semantic segmentation on Cityscapes and BDD100K. All claims rest on measured mIoU deltas, routing statistics, and design-sensitivity observations rather than any first-principles derivation, fitted-parameter prediction, or self-referential equation. No load-bearing step reduces to its own inputs by construction; the work contains no mathematical chain that could be circular.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Encoder-decoder with atrous separable convolution for semantic image segmentation
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision (ECCV), 2018. 4
2018
-
[2]
Dynamic convolution: At- tention over convolution kernels
Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, and Zicheng Liu. Dynamic convolution: At- tention over convolution kernels. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2
2020
-
[3]
Towards understanding the mixture-of-experts layer in deep learning.Advances in neural information pro- cessing systems, 35, 2022
Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding the mixture-of-experts layer in deep learning.Advances in neural information pro- cessing systems, 35, 2022. 1, 2
2022
-
[4]
The cityscapes dataset for semantic urban scene understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 4
2016
-
[5]
Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y . Wu, Zhenda Xie, Y . K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseek- moe: Towards ultimate expert specialization in mixture-of- experts language models. InProceedings of the 62nd An- nual Meeting of the As...
2024
-
[6]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[7]
Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fe- dus, Maarten P
Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fe- dus, Maarten P. Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kath- leen S. Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V . Le, Yonghui Wu, and...
2022
-
[8]
Learning Factored Representations in a Deep Mixture of Experts
David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of ex- perts.arXiv preprint arXiv:1312.4314, 2013. 2
work page Pith review arXiv 2013
-
[9]
Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 2022
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 2022. 1, 2, 3, 6
2022
-
[10]
Xumeng Han, Longhui Wei, Zhiyang Dou, Zipeng Wang, Chenhui Qiang, Xin He, Yingfei Sun, Zhenjun Han, and Qi Tian. Vimoe: An empirical study of designing vision mixture-of-experts.arXiv preprint arXiv:2410.15732, 2024. 1, 2, 6
-
[11]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 4
2016
-
[12]
Le, Mark Sandler, Bo Chen, Weijun Wang, Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, and Yukun Zhu
Andrew Howard, Ruoming Pang, Hartwig Adam, Quoc V . Le, Mark Sandler, Bo Chen, Weijun Wang, Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, and Yukun Zhu. Searching for mobilenetv3. InIEEE Interna- tional Conference on Computer Vision (ICCV), 2019. 4
2019
-
[13]
Tutel: Adaptive mixture-of-experts at scale
Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prab- hat Ram, et al. Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems, 2023. 2
2023
-
[14]
Adaptive mixtures of local experts.Neu- ral computation, 1991
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.Neu- ral computation, 1991. 2
1991
-
[15]
GShard: Scaling Giant Mod- els With Conditional Computation and Automatic Shard- ing
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling Giant Mod- els With Conditional Computation and Automatic Shard- ing. InInternational Conference on Learning Representa- tions (ICLR), 2021. 1, 2
2021
-
[16]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. 2
2021
-
[17]
Modeling task relationships in multi-task learning with multi-gate mixture-of-experts
Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. InInternational Conference on Knowledge Discovery & Data Mining, 2018. 2
2018
-
[18]
Choosing smartly: Adaptive multimodal fusion for object detection in changing environments
Oier Mees, Andreas Eitel, and Wolfram Burgard. Choosing smartly: Adaptive multimodal fusion for object detection in changing environments. InIEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), 2016. 2
2016
-
[19]
ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation
Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Euge- nio Culurciello. Enet: A deep neural network architecture for real-time semantic segmentation.CoRR, abs/1606.02147,
-
[20]
Marius Z¨ollner
Svetlana Pavlitska, Christian Hubschneider, Lukas Struppek, and J. Marius Z¨ollner. Sparsely-gated mixture-of-expert lay- ers for cnn interpretability. InInternational Joint Conference on Neural Networks (IJCNN), 2023. 1, 2
2023
-
[21]
Robust experts: the effect of adversarial training on cnns with sparse mixture-of-experts layers
Svetlana Pavlitska, Haixi Fan, Konstantin Ditschuneit, and Marius Z ¨ollner. Robust experts: the effect of adversarial training on cnns with sparse mixture-of-experts layers. In IEEE International Conference on Computer Vision (ICCV) Workshops, 2025. 3
2025
-
[22]
Using mixture of expert models to gain insights into semantic segmentation
Svetlana Pavlitskaya, Christian Hubschneider, Michael We- ber, Ruby Moritz, Fabian Huger, Peter Schlicht, and Marius Zollner. Using mixture of expert models to gain insights into semantic segmentation. InConference on Computer Vision and Pattern Recognition (CVPR) - Workshops, 2020. 2
2020
-
[23]
Evaluating mixture-of-experts architectures for net- work aggregation
Svetlana Pavlitskaya, Christian Hubschneider, and Michael Weber. Evaluating mixture-of-experts architectures for net- work aggregation. InDeep Neural Networks and Data for Automated Driving: Robustness, Uncertainty Quantifica- tion, and Insights Towards Safety. 2022. 2
2022
-
[24]
From sparse to soft mixtures of experts.arXiv preprint arXiv:2308.00951,
Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts.arXiv preprint arXiv:2308.00951, 2023. 2
-
[25]
Slimconv: Reducing channel redundancy in convolutional neural networks by features recombining
Jiaxiong Qiu, Cai Chen, Shuaicheng Liu, Heng-Yu Zhang, and Bing Zeng. Slimconv: Reducing channel redundancy in convolutional neural networks by features recombining. IEEE Transactions on Image Processing, 2021. 2
2021
-
[26]
Deepspeed-moe: Advanc- ing mixture-of-experts inference and training to power next- generation ai scale
Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advanc- ing mixture-of-experts inference and training to power next- generation ai scale. InInternational conference on machine learning, 2022. 1, 2
2022
-
[27]
Scaling vision with sparse mix- ture of experts
Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr ´e Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mix- ture of experts. InConference on Neural Information Pro- cessing Systems (NIPS/NeurIPS), 2021. 2
2021
-
[28]
Hash layers for large sparse models
Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, and Jason Weston. Hash layers for large sparse models. InConference on Neural Information Processing Systems (NIPS/NeurIPS), 2021. 1
2021
-
[29]
´Alvarez, Luis Miguel Bergasa, and Roberto Arroyo
Eduardo Romera, Jos ´e M. ´Alvarez, Luis Miguel Bergasa, and Roberto Arroyo. Erfnet: Efficient residual factorized convnet for real-time semantic segmentation.IEEE Trans. Intell. Transp. Syst., 2018. 4
2018
-
[30]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2015. 4
2015
-
[31]
Swin2-mose: A new single image supersolution model for remote sensing.IET Image Processing, 2025
Leonardo Rossi, Vittorio Bernuzzi, Tomaso Fontanini, Mas- simo Bertozzi, and Andrea Prati. Swin2-mose: A new single image supersolution model for remote sensing.IET Image Processing, 2025. 2
2025
-
[32]
Le, Geoffrey E
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations (ICLR), 2017. 1, 2, 3
2017
-
[33]
Convo- luted mixture of deep experts for robust semantic segmenta- tion
Abhinav Valada, Ankit Dhall, and Wolfram Burgard. Convo- luted mixture of deep experts for robust semantic segmenta- tion. InInternational Conference on Intelligent Robots and Systems (IROS) - Workshops, 2016. 2
2016
-
[34]
Deep high-resolution represen- tation learning for visual recognition.IEEE transactions on pattern analysis and machine intelligence, 2021
Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution represen- tation learning for visual recognition.IEEE transactions on pattern analysis and machine intelligence, 2021. 4
2021
-
[35]
Gonzalez
Xin Wang, Fisher Yu, Lisa Dunlap, Yi-An Ma, Ruth Wang, Azalia Mirhoseini, Trevor Darrell, and Joseph E. Gonzalez. Deep mixture of experts via shallow embedding. InConfer- ence on Uncertainty in Artificial Intelligence, 2019. 1, 2
2019
-
[36]
Limoe: Mixture of lidar representation learners from automotive scenes.CoRR, abs/2501.04004,
Xiang Xu, Lingdong Kong, Hui Shuai, Liang Pan, Ziwei Liu, and Qingshan Liu. Limoe: Mixture of lidar representation learners from automotive scenes.CoRR, abs/2501.04004,
-
[37]
Go wider instead of deeper
Fuzhao Xue, Ziji Shi, Futao Wei, Yuxuan Lou, Yong Liu, and Yang You. Go wider instead of deeper. InAAAI Conference on Artificial Intelligence (AAAI), 2022. 2, 6
2022
-
[38]
Condconv: Conditionally parameterized convolu- tions for efficient inference.Conference on Neural Informa- tion Processing Systems (NIPS/NeurIPS), 32, 2019
Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan Ngiam. Condconv: Conditionally parameterized convolu- tions for efficient inference.Conference on Neural Informa- tion Processing Systems (NIPS/NeurIPS), 32, 2019. 1, 2
2019
-
[39]
Bdd100k: A diverse driving dataset for heterogeneous multitask learning
Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Dar- rell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 4
2020
-
[40]
Learning a mixture of granularity-specific experts for fine- grained categorization
Lianbo Zhang, Shaoli Huang, Wei Liu, and Dacheng Tao. Learning a mixture of granularity-specific experts for fine- grained categorization. InIEEE International Conference on Computer Vision (ICCV), 2019. 2
2019
-
[41]
Robust mixture-of-expert training for convo- lutional neural networks
Yihua Zhang, Ruisi Cai, Tianlong Chen, Guanhua Zhang, Huan Zhang, Pin-Yu Chen, Shiyu Chang, Zhangyang Wang, and Sijia Liu. Robust mixture-of-expert training for convo- lutional neural networks. InIEEE International Conference on Computer Vision (ICCV), 2023. 1, 2
2023
-
[42]
Pyramid scene parsing network
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 4
2017
-
[43]
Designing effective sparse expert models
Barret Zoph. Designing effective sparse expert models. InIEEE International Parallel and Distributed Processing Symposium, (IPDPS) - Workshops, 2022. 1
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.