pith. machine review for the scientific record. sign in

arxiv: 2604.13761 · v1 · submitted 2026-04-15 · 💻 cs.CV · cs.LG

Recognition: unknown

Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:34 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords mixture of expertssemantic segmentationconvolutional neural networkssparse MoEpatch-wise routingexpert specializationdense predictionCityscapes
0
0 comments X

The pith

Patch-wise sparse mixture-of-experts layers improve CNN semantic segmentation accuracy by up to 3.9 mIoU with low added computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies how sparse mixture-of-experts layers can be added to convolutional networks for semantic segmentation. It replaces the usual fine-grained experts with a coarser patch-wise version that routes local image regions to a small number of convolutional expert blocks. Experiments on Cityscapes and BDD100K using both encoder-decoder and backbone CNNs show consistent but architecture-dependent gains reaching 3.9 points in mean intersection over union, while extra computation remains modest. The results also reveal that routing behavior and expert specialization depend strongly on the surrounding network design. Readers might care because the approach offers a practical way to enlarge model capacity for dense prediction without the full cost of bigger networks.

Core claim

A patch-wise formulation of sparse MoE layers, where local regions are routed to a small subset of convolutional experts, can be integrated into CNN-based semantic segmentation models. When inserted into encoder-decoder and backbone architectures and tested on Cityscapes and BDD100K, these layers produce architecture-dependent improvements of up to +3.9 mIoU with only minor computational overhead, while exposing clear patterns of expert specialization and dynamic routing.

What carries the argument

patch-wise sparse MoE layer in which local image regions are routed to a small subset of convolutional experts

If this is right

  • Gains remain consistent yet vary between encoder-decoder and backbone-only CNN designs.
  • Routing dynamics and expert specialization respond strongly to architectural choices.
  • The added capacity produces only small increases in computational cost.
  • Similar patterns appear on both Cityscapes and BDD100K datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same patch-wise routing approach could be tested on other dense prediction tasks such as instance segmentation or monocular depth estimation.
  • High design sensitivity suggests that automated architecture search might locate even better MoE configurations.
  • If expert specialization generalizes, the method could help adapt segmentation models to new domains with limited retraining.

Load-bearing premise

The reported mIoU gains are produced by the patch-wise MoE routing and resulting expert specialization rather than by other training details, hyperparameter choices, or dataset-specific effects.

What would settle it

Retraining the identical baseline CNN architectures with the same training protocol and hyperparameters but without any MoE layers and observing whether the mIoU difference disappears or reverses.

Figures

Figures reproduced from arXiv: 2604.13761 by Haixi Fan, J. Marius Z\"ollner, Konstantin Ditschuneit, Svetlana Pavlitska.

Figure 1
Figure 1. Figure 1: Overview of the studied PatchConvMoE layer within a CNN for semantic segmentation. A standard convolutional layer ℓ is replaced with a sparse MoE layer comprising convolutional lay￾ers as experts and a patch-wise gating network. The input feature map is split into a g × g grid of patches. Each patch is routed to a small subset of experts using top-k routing (here: a total of n = 8 experts with top-2 routin… view at source ↗
Figure 2
Figure 2. Figure 2: Gate architectures. 3.2. Patch-Level Routing Sparse MoE layers in CNNs can operate at different rout￾ing granularities. Image-level routing, commonly used in prior work for classification, assigns a single expert con￾figuration to the entire feature map, limiting spatial adapt￾ability. Pixel-level routing, in contrast, allows fine-grained specialization but is computationally prohibitive for high￾resolutio… view at source ↗
Figure 4
Figure 4. Figure 4: Expert co-routing frequency during inference. The set [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Sparse mixture-of-experts (MoE) layers have been shown to substantially increase model capacity without a proportional increase in computational cost and are widely used in transformer architectures, where they typically replace feed-forward network blocks. In contrast, integrating sparse MoE layers into convolutional neural networks (CNNs) remains inconsistent, with most prior work focusing on fine-grained MoEs operating at the filter or channel levels. In this work, we investigate a coarser, patch-wise formulation of sparse MoE layers for semantic segmentation, where local regions are routed to a small subset of convolutional experts. Through experiments on the Cityscapes and BDD100K datasets using encoder-decoder and backbone-based CNNs, we conduct a design analysis to assess how architectural choices affect routing dynamics and expert specialization. Our results demonstrate consistent, architecture-dependent improvements (up to +3.9 mIoU) with little computational overhead, while revealing strong design sensitivity. Our work provides empirical insights into the design and internal dynamics of sparse MoE layers in CNN-based dense prediction. Our code is available at https://github.com/KASTEL-MobilityLab/moe-layers/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that integrating a coarser patch-wise sparse MoE formulation into CNNs for semantic segmentation yields architecture-dependent mIoU gains (up to +3.9) on Cityscapes and BDD100K with low computational overhead. It contrasts this with prior fine-grained MoE work, reports experiments across encoder-decoder and backbone CNNs, and provides a design analysis of routing dynamics and expert specialization, concluding that performance is highly sensitive to architectural choices.

Significance. If the causal attribution holds, the work supplies empirical guidance on adapting MoE to CNN-based dense prediction, an area less mature than transformer MoE. The design-sensitivity findings and public code could help practitioners implement efficient segmentation models and avoid common pitfalls in routing and expert configuration.

major comments (3)
  1. [§4] §4 (Experimental Setup and Results): The manuscript does not state whether MoE-augmented models and baselines were trained under identical hyperparameter regimes, optimizer choices, or total training iterations. This is load-bearing for the central claim, because any incidental training differences could produce the reported mIoU deltas independently of patch-wise routing or expert specialization.
  2. [Results tables] Results tables (e.g., those reporting per-architecture mIoU): No capacity-matched controls (dense model with equivalent parameter count or FLOPs) or single-expert ablations are described. Without these, the +3.9 mIoU improvement and the attribution to sparse routing cannot be isolated from simple increases in effective capacity.
  3. [§5] §5 (Design Analysis): The claim of “strong design sensitivity” is supported only by qualitative observations; quantitative metrics of expert utilization, load balance, or specialization (e.g., entropy of routing distributions or inter-expert feature diversity) are not reported, weakening the behavioral insights that accompany the performance numbers.
minor comments (2)
  1. [Abstract] The abstract states “consistent, architecture-dependent improvements” but does not list the per-architecture deltas; adding a one-sentence summary would improve clarity.
  2. Figure captions describing routing visualizations should explicitly state the color scale or metric used so readers can interpret expert activation patterns without referring to the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to strengthen the experimental rigor and analysis.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Setup and Results): The manuscript does not state whether MoE-augmented models and baselines were trained under identical hyperparameter regimes, optimizer choices, or total training iterations. This is load-bearing for the central claim, because any incidental training differences could produce the reported mIoU deltas independently of patch-wise routing or expert specialization.

    Authors: We confirm that all models (MoE-augmented and baselines) were trained under identical conditions using the same optimizer (AdamW), learning rate schedule, batch size, total iterations, and data augmentations as described in the experimental protocol. This equivalence was followed throughout but not explicitly reiterated for the MoE variants. We have added a dedicated clarification paragraph in the revised §4 stating that identical hyperparameter regimes were used for all compared models. revision: yes

  2. Referee: [Results tables] Results tables (e.g., those reporting per-architecture mIoU): No capacity-matched controls (dense model with equivalent parameter count or FLOPs) or single-expert ablations are described. Without these, the +3.9 mIoU improvement and the attribution to sparse routing cannot be isolated from simple increases in effective capacity.

    Authors: We agree that capacity-matched controls and single-expert ablations are necessary to isolate the contribution of sparse routing. In the revised manuscript we have added new experiments: (i) dense baselines whose channel widths were scaled to match the total parameter count and FLOPs of the corresponding MoE models, and (ii) single-expert ablations in which routing is replaced by a fixed assignment to one expert. The updated results tables show that the reported mIoU gains remain after these controls, supporting attribution to the patch-wise sparse mechanism. revision: yes

  3. Referee: [§5] §5 (Design Analysis): The claim of “strong design sensitivity” is supported only by qualitative observations; quantitative metrics of expert utilization, load balance, or specialization (e.g., entropy of routing distributions or inter-expert feature diversity) are not reported, weakening the behavioral insights that accompany the performance numbers.

    Authors: We acknowledge that the original §5 relied on qualitative routing visualizations. To provide quantitative support we have added the following metrics to the revised design analysis: entropy of the per-patch routing distributions, coefficient of variation for expert load balance, and average cosine similarity between expert output features as a measure of specialization. These statistics are now reported for each architectural variant and corroborate the observed design sensitivity. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical study with no derivations or predictions

full rationale

The paper reports experimental results on inserting patch-wise sparse MoE layers into CNN encoders/decoders for semantic segmentation on Cityscapes and BDD100K. All claims rest on measured mIoU deltas, routing statistics, and design-sensitivity observations rather than any first-principles derivation, fitted-parameter prediction, or self-referential equation. No load-bearing step reduces to its own inputs by construction; the work contains no mathematical chain that could be circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine learning paper. The central claims rest on experimental results from public benchmark datasets rather than on mathematical derivations or new theoretical constructs. No free parameters, axioms, or invented entities are required to support the reported findings.

pith-pipeline@v0.9.0 · 5512 in / 1133 out tokens · 36013 ms · 2026-05-10T13:34:03.543935+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    Encoder-decoder with atrous separable convolution for semantic image segmentation

    Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision (ECCV), 2018. 4

  2. [2]

    Dynamic convolution: At- tention over convolution kernels

    Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, and Zicheng Liu. Dynamic convolution: At- tention over convolution kernels. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2

  3. [3]

    Towards understanding the mixture-of-experts layer in deep learning.Advances in neural information pro- cessing systems, 35, 2022

    Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding the mixture-of-experts layer in deep learning.Advances in neural information pro- cessing systems, 35, 2022. 1, 2

  4. [4]

    The cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 4

  5. [5]

    Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y . Wu, Zhenda Xie, Y . K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseek- moe: Towards ultimate expert specialization in mixture-of- experts language models. InProceedings of the 62nd An- nual Meeting of the As...

  6. [6]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2

  7. [7]

    Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fe- dus, Maarten P

    Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fe- dus, Maarten P. Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kath- leen S. Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V . Le, Yonghui Wu, and...

  8. [8]

    Learning Factored Representations in a Deep Mixture of Experts

    David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of ex- perts.arXiv preprint arXiv:1312.4314, 2013. 2

  9. [9]

    Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 2022. 1, 2, 3, 6

  10. [10]

    Vimoe: An empirical study of designing vision mixture-of-experts.arXiv preprint arXiv:2410.15732, 2024

    Xumeng Han, Longhui Wei, Zhiyang Dou, Zipeng Wang, Chenhui Qiang, Xin He, Yingfei Sun, Zhenjun Han, and Qi Tian. Vimoe: An empirical study of designing vision mixture-of-experts.arXiv preprint arXiv:2410.15732, 2024. 1, 2, 6

  11. [11]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 4

  12. [12]

    Le, Mark Sandler, Bo Chen, Weijun Wang, Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, and Yukun Zhu

    Andrew Howard, Ruoming Pang, Hartwig Adam, Quoc V . Le, Mark Sandler, Bo Chen, Weijun Wang, Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, and Yukun Zhu. Searching for mobilenetv3. InIEEE Interna- tional Conference on Computer Vision (ICCV), 2019. 4

  13. [13]

    Tutel: Adaptive mixture-of-experts at scale

    Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prab- hat Ram, et al. Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems, 2023. 2

  14. [14]

    Adaptive mixtures of local experts.Neu- ral computation, 1991

    Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.Neu- ral computation, 1991. 2

  15. [15]

    GShard: Scaling Giant Mod- els With Conditional Computation and Automatic Shard- ing

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling Giant Mod- els With Conditional Computation and Automatic Shard- ing. InInternational Conference on Learning Representa- tions (ICLR), 2021. 1, 2

  16. [16]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. 2

  17. [17]

    Modeling task relationships in multi-task learning with multi-gate mixture-of-experts

    Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. InInternational Conference on Knowledge Discovery & Data Mining, 2018. 2

  18. [18]

    Choosing smartly: Adaptive multimodal fusion for object detection in changing environments

    Oier Mees, Andreas Eitel, and Wolfram Burgard. Choosing smartly: Adaptive multimodal fusion for object detection in changing environments. InIEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), 2016. 2

  19. [19]

    ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

    Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Euge- nio Culurciello. Enet: A deep neural network architecture for real-time semantic segmentation.CoRR, abs/1606.02147,

  20. [20]

    Marius Z¨ollner

    Svetlana Pavlitska, Christian Hubschneider, Lukas Struppek, and J. Marius Z¨ollner. Sparsely-gated mixture-of-expert lay- ers for cnn interpretability. InInternational Joint Conference on Neural Networks (IJCNN), 2023. 1, 2

  21. [21]

    Robust experts: the effect of adversarial training on cnns with sparse mixture-of-experts layers

    Svetlana Pavlitska, Haixi Fan, Konstantin Ditschuneit, and Marius Z ¨ollner. Robust experts: the effect of adversarial training on cnns with sparse mixture-of-experts layers. In IEEE International Conference on Computer Vision (ICCV) Workshops, 2025. 3

  22. [22]

    Using mixture of expert models to gain insights into semantic segmentation

    Svetlana Pavlitskaya, Christian Hubschneider, Michael We- ber, Ruby Moritz, Fabian Huger, Peter Schlicht, and Marius Zollner. Using mixture of expert models to gain insights into semantic segmentation. InConference on Computer Vision and Pattern Recognition (CVPR) - Workshops, 2020. 2

  23. [23]

    Evaluating mixture-of-experts architectures for net- work aggregation

    Svetlana Pavlitskaya, Christian Hubschneider, and Michael Weber. Evaluating mixture-of-experts architectures for net- work aggregation. InDeep Neural Networks and Data for Automated Driving: Robustness, Uncertainty Quantifica- tion, and Insights Towards Safety. 2022. 2

  24. [24]

    From sparse to soft mixtures of experts.arXiv preprint arXiv:2308.00951,

    Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts.arXiv preprint arXiv:2308.00951, 2023. 2

  25. [25]

    Slimconv: Reducing channel redundancy in convolutional neural networks by features recombining

    Jiaxiong Qiu, Cai Chen, Shuaicheng Liu, Heng-Yu Zhang, and Bing Zeng. Slimconv: Reducing channel redundancy in convolutional neural networks by features recombining. IEEE Transactions on Image Processing, 2021. 2

  26. [26]

    Deepspeed-moe: Advanc- ing mixture-of-experts inference and training to power next- generation ai scale

    Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advanc- ing mixture-of-experts inference and training to power next- generation ai scale. InInternational conference on machine learning, 2022. 1, 2

  27. [27]

    Scaling vision with sparse mix- ture of experts

    Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr ´e Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mix- ture of experts. InConference on Neural Information Pro- cessing Systems (NIPS/NeurIPS), 2021. 2

  28. [28]

    Hash layers for large sparse models

    Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, and Jason Weston. Hash layers for large sparse models. InConference on Neural Information Processing Systems (NIPS/NeurIPS), 2021. 1

  29. [29]

    ´Alvarez, Luis Miguel Bergasa, and Roberto Arroyo

    Eduardo Romera, Jos ´e M. ´Alvarez, Luis Miguel Bergasa, and Roberto Arroyo. Erfnet: Efficient residual factorized convnet for real-time semantic segmentation.IEEE Trans. Intell. Transp. Syst., 2018. 4

  30. [30]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2015. 4

  31. [31]

    Swin2-mose: A new single image supersolution model for remote sensing.IET Image Processing, 2025

    Leonardo Rossi, Vittorio Bernuzzi, Tomaso Fontanini, Mas- simo Bertozzi, and Andrea Prati. Swin2-mose: A new single image supersolution model for remote sensing.IET Image Processing, 2025. 2

  32. [32]

    Le, Geoffrey E

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations (ICLR), 2017. 1, 2, 3

  33. [33]

    Convo- luted mixture of deep experts for robust semantic segmenta- tion

    Abhinav Valada, Ankit Dhall, and Wolfram Burgard. Convo- luted mixture of deep experts for robust semantic segmenta- tion. InInternational Conference on Intelligent Robots and Systems (IROS) - Workshops, 2016. 2

  34. [34]

    Deep high-resolution represen- tation learning for visual recognition.IEEE transactions on pattern analysis and machine intelligence, 2021

    Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution represen- tation learning for visual recognition.IEEE transactions on pattern analysis and machine intelligence, 2021. 4

  35. [35]

    Gonzalez

    Xin Wang, Fisher Yu, Lisa Dunlap, Yi-An Ma, Ruth Wang, Azalia Mirhoseini, Trevor Darrell, and Joseph E. Gonzalez. Deep mixture of experts via shallow embedding. InConfer- ence on Uncertainty in Artificial Intelligence, 2019. 1, 2

  36. [36]

    Limoe: Mixture of lidar representation learners from automotive scenes.CoRR, abs/2501.04004,

    Xiang Xu, Lingdong Kong, Hui Shuai, Liang Pan, Ziwei Liu, and Qingshan Liu. Limoe: Mixture of lidar representation learners from automotive scenes.CoRR, abs/2501.04004,

  37. [37]

    Go wider instead of deeper

    Fuzhao Xue, Ziji Shi, Futao Wei, Yuxuan Lou, Yong Liu, and Yang You. Go wider instead of deeper. InAAAI Conference on Artificial Intelligence (AAAI), 2022. 2, 6

  38. [38]

    Condconv: Conditionally parameterized convolu- tions for efficient inference.Conference on Neural Informa- tion Processing Systems (NIPS/NeurIPS), 32, 2019

    Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan Ngiam. Condconv: Conditionally parameterized convolu- tions for efficient inference.Conference on Neural Informa- tion Processing Systems (NIPS/NeurIPS), 32, 2019. 1, 2

  39. [39]

    Bdd100k: A diverse driving dataset for heterogeneous multitask learning

    Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Dar- rell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 4

  40. [40]

    Learning a mixture of granularity-specific experts for fine- grained categorization

    Lianbo Zhang, Shaoli Huang, Wei Liu, and Dacheng Tao. Learning a mixture of granularity-specific experts for fine- grained categorization. InIEEE International Conference on Computer Vision (ICCV), 2019. 2

  41. [41]

    Robust mixture-of-expert training for convo- lutional neural networks

    Yihua Zhang, Ruisi Cai, Tianlong Chen, Guanhua Zhang, Huan Zhang, Pin-Yu Chen, Shiyu Chang, Zhangyang Wang, and Sijia Liu. Robust mixture-of-expert training for convo- lutional neural networks. InIEEE International Conference on Computer Vision (ICCV), 2023. 1, 2

  42. [42]

    Pyramid scene parsing network

    Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 4

  43. [43]

    Designing effective sparse expert models

    Barret Zoph. Designing effective sparse expert models. InIEEE International Parallel and Distributed Processing Symposium, (IPDPS) - Workshops, 2022. 1