pith. machine review for the scientific record. sign in

arxiv: 2604.18866 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

HMR-Net: Hierarchical Modular Routing for Cross-Domain Object Detection in Aerial Images

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords aerial object detectioncross-domain generalizationmodular neural networkshierarchical routingopen-vocabulary detectionremote sensingexpert modulesdomain adaptation
0
0 comments X

The pith

Hierarchical modular routing lets aerial object detectors specialize across datasets and detect new categories from text alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Aerial images differ sharply in resolution, scene layout, and label sets depending on location and sensor, so conventional detectors trained on one collection often perform poorly on others and cannot handle unseen object types. The paper proposes a framework that routes inputs through specialized modules at two scales: a global layer that uses latent geographic embeddings to assign whole datasets to dedicated experts, and a local layer that breaks complex scenes into subregions for region-specific sub-modules. A conditional expert further accepts external semantic cues such as category names or descriptions, allowing the system to recognize novel objects at inference time without any retraining. If the routing works as intended, detectors could maintain high accuracy across varied geographic domains while expanding their vocabulary on demand, lowering the cost of adapting models to new aerial collections.

Core claim

The central claim is that a hierarchical modular routing network overcomes cross-domain limitations in aerial object detection through a global expert assignment layer that employs latent geographic embeddings to route datasets to specialized processing modules, a local scene decomposition mechanism that allocates image subregions to region-specific sub-modules, and a conditional expert module that incorporates external semantic information such as category names or textual descriptions to detect novel object categories during inference without retraining or fine-tuning.

What carries the argument

The hierarchical routing mechanism, with its global assignment layer driven by latent geographic embeddings and its local decomposition layer that directs subregions to specialized sub-modules.

If this is right

  • The system achieves improved generalization when trained and tested across multiple aerial datasets that differ in geography and resolution.
  • Local scene decomposition enables more accurate detection inside complex images by applying specialized processing to different subregions.
  • The conditional expert supports open-category detection by recognizing novel objects from textual descriptions at test time without model updates.
  • Overall, the framework reduces dependence on monolithic representations that force a single model to handle all domain variations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-level routing structure could be adapted to other image domains that exhibit spatial or contextual shifts, such as medical scans from different hospitals.
  • Combining the conditional expert with richer language models might allow more precise control over which novel categories are recognized.
  • If routing errors prove rare, the approach suggests a general alternative to full retraining whenever new aerial data arrives.
  • Experiments that vary the quality or source of the geographic embeddings could show how much the global routing depends on accurate domain encoding.

Load-bearing premise

The assumption that latent geographic embeddings can reliably capture domain differences to route images correctly without introducing errors or requiring dataset-specific tuning.

What would settle it

A controlled test in which the routed model shows no accuracy gain, or a loss, over a standard non-modular detector when evaluated on a held-out aerial dataset with distinct geographic and semantic characteristics, or when asked to detect objects from categories supplied only by text descriptions.

read the original abstract

Despite advances in object detection, aerial imagery remains a challenging domain, as models often fail to generalize across variations in spatial resolution, scene composition, and semantic label coverage. Differences in geographic context, sensor characteristics, and object distributions across datasets limit the capacity of conventional models to learn consistent and transferable representations. Shared methods trained on such data tend to impose a unified representation across fundamentally different domains, resulting in poor performance on region-specific content and less flexibility when dealing with novel object categories. To address this, we propose a novel modular learning framework that enables structured specialization in aerial detection. Our method introduces a hierarchical routing mechanism with two levels of modularity: a global expert assignment layer that uses latent geographic embeddings to route datasets to specialized processing modules, and a local scene decomposition mechanism that allocates image subregions to region-specific sub-modules. This allows our method to specialize across datasets and within complex scenes. Additionally, the framework contains a conditional expert module that uses external semantic information (e.g., category names or textual descriptions) to enable detection of novel object categories during inference, without the need for retraining or fine-tuning. By moving beyond monolithic representations, our method offers an adaptive framework for remote sensing object detection. Comprehensive evaluations on four datasets highlight improvements in multi-dataset generalization, regional specialization, and open-category detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes HMR-Net, a hierarchical modular routing framework for cross-domain object detection in aerial images. It features a global expert assignment layer using latent geographic embeddings to route datasets to specialized processing modules, a local scene decomposition mechanism that allocates image subregions to region-specific sub-modules, and a conditional expert module that incorporates external semantic information (e.g., category names or textual descriptions) to detect novel object categories at inference time without retraining. The authors claim this enables better multi-dataset generalization, regional specialization within scenes, and open-category detection, with comprehensive evaluations on four datasets demonstrating improvements over conventional approaches.

Significance. If the hierarchical routing and conditional expert mechanisms are shown to work as described, the work could meaningfully advance remote sensing object detection by replacing monolithic models with an adaptive, modular architecture that handles domain shifts and novel categories more flexibly, reducing reliance on per-dataset fine-tuning.

major comments (2)
  1. [Description of the global expert assignment layer (hierarchical routing mechanism)] The central claim that latent geographic embeddings in the global expert assignment layer reliably capture transferable domain differences (geographic context, sensor traits, object distributions) without overfitting is load-bearing for the entire framework, yet the description provides no analysis of how these embeddings are learned or validated; if trained end-to-end on the four datasets, they risk encoding dataset-specific artifacts rather than generalizable factors, which would propagate routing errors to the local decomposition and conditional expert modules.
  2. [Abstract and experimental evaluation section] The abstract states that evaluations on four datasets highlight improvements in multi-dataset generalization, regional specialization, and open-category detection, but no baselines, metrics (e.g., mAP), ablation studies isolating the routing components, or error analysis for the conditional expert module are referenced; without these, it is not possible to determine whether the empirical results actually support the claimed advantages.
minor comments (1)
  1. [Method section] The notation and equations defining the routing probabilities, expert assignment, and conditional module integration could be presented more explicitly to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of the embedding validation and experimental clarity that we have addressed through targeted revisions.

read point-by-point responses
  1. Referee: [Description of the global expert assignment layer (hierarchical routing mechanism)] The central claim that latent geographic embeddings in the global expert assignment layer reliably capture transferable domain differences (geographic context, sensor traits, object distributions) without overfitting is load-bearing for the entire framework, yet the description provides no analysis of how these embeddings are learned or validated; if trained end-to-end on the four datasets, they risk encoding dataset-specific artifacts rather than generalizable factors, which would propagate routing errors to the local decomposition and conditional expert modules.

    Authors: We agree that a rigorous validation of the latent geographic embeddings is necessary to substantiate the framework's claims. The embeddings are learned end-to-end within the global expert assignment layer using a joint objective that combines task-specific detection loss with an embedding regularization term (based on contrastive alignment across geographic proxies) to promote transferable features. In the revised manuscript, we have added a dedicated analysis subsection (new Section 4.4) that includes t-SNE visualizations of the embedding space demonstrating clustering by geographic and sensor characteristics rather than dataset identity, quantitative correlation metrics between embedding distances and domain attributes (e.g., resolution variance and class distribution shifts), and a controlled ablation replacing learned embeddings with one-hot dataset identifiers. These additions show that the embeddings capture generalizable factors and reduce the risk of propagating dataset-specific artifacts. revision: yes

  2. Referee: [Abstract and experimental evaluation section] The abstract states that evaluations on four datasets highlight improvements in multi-dataset generalization, regional specialization, and open-category detection, but no baselines, metrics (e.g., mAP), ablation studies isolating the routing components, or error analysis for the conditional expert module are referenced; without these, it is not possible to determine whether the empirical results actually support the claimed advantages.

    Authors: The abstract is written at a high level to summarize the overall contributions. To improve transparency, we have revised the abstract to explicitly reference the evaluation metric (mAP), representative baselines (including standard detectors and cross-domain adaptation methods), and the role of ablations on the hierarchical routing components. We have also expanded the experimental section with additional ablation tables isolating global/local routing and conditional experts, plus an error analysis subsection for the conditional expert module that reports per-category performance on novel classes and common failure modes. These changes make the empirical support for the claimed advantages more directly verifiable while respecting abstract length constraints. revision: partial

Circularity Check

0 steps flagged

No circularity: novel architecture presented without self-referential derivations or fitted predictions

full rationale

The paper describes a new hierarchical modular routing framework using latent geographic embeddings for global dataset routing, local scene decomposition, and conditional experts for open-category detection. No equations, derivations, or first-principles results are provided in the abstract or described text that reduce any claimed prediction or specialization to its own inputs by construction. The method is introduced as an original construction evaluated empirically on four datasets for generalization and specialization. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear. The framework's claims rest on architectural novelty and experimental outcomes rather than tautological reductions, making it self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

Review limited to abstract; no explicit free parameters, derivations, or invented physical entities described. The approach rests on domain assumptions about aerial image variability.

axioms (1)
  • domain assumption Aerial imagery exhibits significant domain shifts due to variations in spatial resolution, scene composition, geographic context, sensor characteristics, and object distributions.
    Explicitly stated as the core motivation and limitation of conventional models.
invented entities (3)
  • Global expert assignment layer using latent geographic embeddings no independent evidence
    purpose: Route entire datasets to specialized processing modules
    Introduced as first level of modularity; no independent evidence provided.
  • Local scene decomposition mechanism no independent evidence
    purpose: Allocate image subregions to region-specific sub-modules
    Introduced as second level of modularity; no independent evidence provided.
  • Conditional expert module using external semantic information no independent evidence
    purpose: Enable detection of novel object categories at inference without retraining
    Introduced for open-category detection; no independent evidence provided.

pith-pipeline@v0.9.0 · 5548 in / 1324 out tokens · 45731 ms · 2026-05-10T04:26:24.529880+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 8 canonical work pages

  1. [1]

    International Journal of Computer Vision, 1–22 (2024)

    Li, Y., Li, X., Dai, Y., Hou, Q., Liu, L., Liu, Y., Cheng, M.-M., Yang, J.: Lsknet: A foundation lightweight backbone for remote sensing. International Journal of Computer Vision, 1–22 (2024)

  2. [2]

    International Journal of Computer Vision, 1–13 (2024)

    Hu, H., Han, T., Wang, Y., Zhong, W., Yue, J., Zan, P.: Hierarchical active learning for low-altitude drone-view object detection. International Journal of Computer Vision, 1–13 (2024)

  3. [3]

    IEEE Transactions on Geoscience and Remote Sensing (2023)

    Shamsolmoali, P., Chanussot, J., Zhou, H., Lu, Y.: Efficient object detection in optical remote sensing imagery via attention-based feature distillation. IEEE Transactions on Geoscience and Remote Sensing (2023)

  4. [4]

    International Journal of Computer Vision, 1–29 (2024)

    Wang, K., Fu, X., Ge, C., Cao, C., Zha, Z.-J.: Towards generalized uav object detection: A novel perspective from frequency domain disentanglement. International Journal of Computer Vision, 1–29 (2024)

  5. [5]

    International Journal of Computer Vision130(5), 1340– 1365 (2022)

    Yang, X., Yan, J.: On the arbitrary-oriented object detection: Classification based approaches revisited. International Journal of Computer Vision130(5), 1340– 1365 (2022)

  6. [6]

    IEEE Transactions on Geoscience and Remote Sensing (2023)

    Deng, C., Jing, D., Han, Y., Chanussot, J.: Towards hierarchical adaptive align- ment for aerial object detection in remote sensing images. IEEE Transactions on Geoscience and Remote Sensing (2023)

  7. [7]

    IEEE Transactions on Geoscience and Remote Sensing60, 1–14 (2021)

    Shamsolmoali, P., Zareapoor, M., Chanussot, J., Zhou, H., Yang, J.: Rotation equivariant feature image pyramid network for object detection in optical remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing60, 1–14 (2021)

  8. [8]

    A survey on remote sens- ing foundation models: From vision to multimodality.arXiv preprint arXiv:2503.22081, 2025

    Huang, Z., Yan, H., Zhan, Q., Yang, S., Zhang, M., Zhang, C., Lei, Y., Liu, Z., Liu, Q., Wang, Y.: A survey on remote sensing foundation models: From vision to multimodality. arXiv preprint arXiv:2503.22081 (2025) 29

  9. [9]

    In: European Conference Computer Vision, pp

    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European Conference Computer Vision, pp. 740–755 (2014)

  10. [10]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)

  11. [11]

    ISPRS Journal of Photogrammetry and Remote Sensing159, 296–307 (2020)

    Li, K., Wan, G., Cheng, G., Meng, L., Han, J.: Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS Journal of Photogrammetry and Remote Sensing159, 296–307 (2020)

  12. [12]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Xia, G.-S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, M., Zhang, L.: Dota: A large-scale dataset for object detection in aerial images. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3974–3983 (2018)

  13. [13]

    xview: Objects in context in overhead imagery,

    Lam, D., Kuzma, R., McGee, K., Dooley, S., Laielli, M., Klaric, M., Bulatov, Y., McCord, B.: xview: Objects in context in overhead imagery. arXiv preprint arXiv:1802.07856 (2018)

  14. [14]

    IEEE Transactions on Pattern Analysis and Machine Intelligence45(11), 13467–13488 (2023)

    Cheng, G., Yuan, X., Yao, X., Yan, K., Zeng, Q., Xie, X., Han, J.: Towards large-scale small object detection: Survey and benchmarks. IEEE Transactions on Pattern Analysis and Machine Intelligence45(11), 13467–13488 (2023)

  15. [15]

    arXiv preprint arXiv:2401.17916 (2024)

    Liu, W., Liu, J., Su, X., Nie, H., Luo, B.: Source-free domain adaptive object detection in remote sensing images. arXiv preprint arXiv:2401.17916 (2024)

  16. [16]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Zhang, X., He, Y., Xu, R., Yu, H., Shen, Z., Cui, P.: Nico++: Towards bet- ter benchmarking for domain generalization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16036–16047 (2023)

  17. [17]

    ISPRS Journal of Photogrammetry and Remote Sensing223, 207–220 (2025)

    Bi, Q., Zhou, B., Yi, J., Ji, W., Zhan, H., Xia, G.-S.: Good: Towards domain generalized oriented object detection. ISPRS Journal of Photogrammetry and Remote Sensing223, 207–220 (2025)

  18. [18]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Wallingford, M., Li, H., Achille, A., Ravichandran, A., Fowlkes, C., Bhotika, R., Soatto, S.: Task adaptive parameter sharing for multi-task learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7561–7570 (2022)

  19. [19]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Liu, Y., Lu, Y., Liu, H., An, Y., Xu, Z., Yao, Z., Zhang, B., Xiong, Z., Gui, C.: Hierarchical prompt learning for multi-task learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10888–10898 (2023)

  20. [20]

    Advances in Neural Information 30 Processing Systems36, 69625–69637 (2023)

    Jain, Y., Behl, H., Kira, Z., Vineet, V.: Damex: Dataset-aware mixture-of-experts for visual understanding of mixture-of-datasets. Advances in Neural Information 30 Processing Systems36, 69625–69637 (2023)

  21. [21]

    Advances in Neural Information Processing Systems35, 28441–28457 (2022)

    Fan, Z., Sarkar, R., Jiang, Z., Chen, T., Zou, K., Cheng, Y., Hao, C., Wang, Z., et al.: M 3vit: Mixture-of-experts vision transformer for efficient multi-task learn- ing with model-accelerator co-design. Advances in Neural Information Processing Systems35, 28441–28457 (2022)

  22. [22]

    Advances in Neural Information Processing Systems37, 119025–119062 (2024)

    Le, M., Nguyen, H., Nguyen, T., Pham, T., Ngo, L., Ho, N.,et al.: Mixture of experts meets prompt-based continual learning. Advances in Neural Information Processing Systems37, 119025–119062 (2024)

  23. [23]

    IEEE Geoscience and Remote Sensing Magazine11(4), 8–44 (2023)

    Zhang, X., Zhang, T., Wang, G., Zhu, P., Tang, X., Jia, X., Jiao, L.: Remote sensing object detection meets deep learning: A metareview of challenges and advances. IEEE Geoscience and Remote Sensing Magazine11(4), 8–44 (2023)

  24. [24]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Wu, X., Zhu, F., Zhao, R., Li, H.: Cora: Adapting clip for open-vocabulary detec- tion with region prompting and anchor pre-matching. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7031–7040 (2023)

  25. [25]

    The Annual AAAI Conference on Artificial Intelligence (2025)

    Zareapoor, M., Shamsolmoali, P., Lu, Y.: Bimac: Bidirectional multimodal alignment in contrastive learning. The Annual AAAI Conference on Artificial Intelligence (2025)

  26. [26]

    IEEE Transactions on Geoscience and Remote Sensing60, 1–13 (2021)

    Shamsolmoali, P., Chanussot, J., Zareapoor, M., Zhou, H., Yang, J.: Multipatch feature pyramid network for weakly supervised object detection in optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing60, 1–13 (2021)

  27. [27]

    International Conference on Learning Representations (2023)

    Yang, X., Zhang, G., Li, W., Wang, X., Zhou, Y., Yan, J.: H2rbox: Horizontal box annotation is all you need for oriented object detection. International Conference on Learning Representations (2023)

  28. [28]

    In: IEEE/CVF International Conference on Computer Vision (2021)

    Xie, X., Cheng, G., Wang, J., Yao, X., Han, J.: Oriented r-cnn for object detection. In: IEEE/CVF International Conference on Computer Vision (2021)

  29. [29]

    IEEE Geoscience and Remote Sensing Letters 20(2023)

    Lin, J., Zhao, Y., Wang, S., Tang, Y.: Yolo-da: An efficient yolo-based detector for remote sensing object detection. IEEE Geoscience and Remote Sensing Letters 20(2023)

  30. [30]

    Scientific Reports15(1), 3125 (2025)

    Wei, X., Li, Z., Wang, Y.: Sed-yolo based multi-scale attention for small object detection in remote sensing. Scientific Reports15(1), 3125 (2025)

  31. [31]

    IEEE Transactions on Geoscience and Remote Sensing (2024)

    He, H., Ding, J., Xu, B., Xia, G.-S.: On the robustness of object detection models on aerial images. IEEE Transactions on Geoscience and Remote Sensing (2024)

  32. [32]

    IEEE Journal of Selected 31 Topics in Applied Earth Observations and Remote Sensing15, 4667–4679 (2022)

    Miao, T., Zeng, H., Yang, W., Chu, B., Zou, F., Ren, W., Chen, J.: An improved lightweight retinanet for ship detection in sar images. IEEE Journal of Selected 31 Topics in Applied Earth Observations and Remote Sensing15, 4667–4679 (2022)

  33. [33]

    IEEE Transactions on Geoscience and Remote Sensing59(7), 6154–6168 (2020)

    Sun, X., Liu, Y., Yan, Z., Wang, P., Diao, W., Fu, K.: Sraf-net: Shape robust anchor-free network for garbage dumps in remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing59(7), 6154–6168 (2020)

  34. [34]

    International Journal of Computer Vision132(11), 5030–5047 (2024)

    Liu, W., Li, Q., Lin, X., Yang, W., He, S., Yu, Y.: Ultra-high resolution image segmentation via locality-aware context fusion and alternating local enhancement. International Journal of Computer Vision132(11), 5030–5047 (2024)

  35. [35]

    IEEE Geoscience and Remote Sensing Letters20, 1–5 (2023)

    Zhu, J., Chen, X., Zhang, H., Tan, Z., Wang, S., Ma, H.: Transformer based remote sensing object detection with enhanced multispectral feature extraction. IEEE Geoscience and Remote Sensing Letters20, 1–5 (2023)

  36. [36]

    IEEE Transactions on Geoscience and Remote Sensing (2024)

    Zhao, J., Ding, Z., Zhou, Y., Zhu, H., Du, W.-L., Yao, R., El Saddik, A.: Ori- entedformer: An end-to-end transformer-based oriented object detector in remote sensing images. IEEE Transactions on Geoscience and Remote Sensing (2024)

  37. [37]

    IEEE Transactions on Geoscience and Remote Sensing (2024)

    Liu, D., Zhang, J., Qi, Y., Wu, Y., Zhang, Y.: Tiny object detection in remote sensing images based on object reconstruction and multiple receptive field adap- tive feature enhancement. IEEE Transactions on Geoscience and Remote Sensing (2024)

  38. [38]

    Rtmdet: An empirical study of designing real-time object detectors.arXiv preprint arXiv:2212.07784,

    Lyu, C., Zhang, W., Huang, H., Zhou, Y., Wang, Y., Liu, Y., Zhang, S., Chen, K.: Rtmdet: An empirical study of designing real-time object detectors. arXiv preprint arXiv:2212.07784 (2022)

  39. [39]

    In: Asian Conference on Computer Vision, pp

    Azimi, S.M., Vig, E., Bahmanyar, R., K¨ orner, M., Reinartz, P.: Towards multi- class object detection in unconstrained remote sensing imagery. In: Asian Conference on Computer Vision, pp. 150–165 (2018)

  40. [40]

    IEEE Transactions on Geoscience and Remote Sensing (2024)

    Luo, S., Ma, L., Yang, X., Luo, D., Du, Q.: Self-training based unsuper- vised domain adaptation for object detection in remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing (2024)

  41. [41]

    ISPRS Journal of Photogrammetry and Remote Sensing208, 39–52 (2024)

    Ma, Y., Chai, L., Jin, L., Yan, J.: Hierarchical alignment network for domain adaptive object detection in aerial images. ISPRS Journal of Photogrammetry and Remote Sensing208, 39–52 (2024)

  42. [42]

    In: European Conference on Computer Vision, pp

    Liu, Y.-C., Ma, C.-Y., Dai, X., Tian, J., Vajda, P., He, Z., Kira, Z.: Open-set semi-supervised object detection. In: European Conference on Computer Vision, pp. 143–159 (2022)

  43. [43]

    Transactions on Machine Learning Research (2024) 32

    Oksuz, K., Kuzucu, S., Joy, T., Dokania, P.K.: Mocae: Mixture of calibrated experts significantly improves object detection. Transactions on Machine Learning Research (2024) 32

  44. [44]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Chen, Y., Wang, M., Mittal, A., Xu, Z., Favaro, P., Tighe, J., Modolo, D.: Scaledet: A scalable multi-dataset object detector. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7288–7297 (2023)

  45. [45]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Meng, L., Dai, X., Chen, Y., Zhang, P., Chen, D., Liu, M., Wang, J., Wu, Z., Yuan, L., Jiang, Y.-G.: Detection hub: Unifying object detection datasets via query adaptation on language embedding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11402–11411 (2023)

  46. [46]

    In: European Conference on Computer Vision, pp

    Shi, C., Zhu, Y., Yang, S.: Plain-det: A plain multi-dataset object detector. In: European Conference on Computer Vision, pp. 210–226 (2024)

  47. [47]

    In: IEEE/CVF International Conference on Computer Vision, pp

    Yang, H., Wu, H., Chen, H.: Detecting 11k classes: Large scale object detection without fine-grained bounding boxes. In: IEEE/CVF International Conference on Computer Vision, pp. 9805–9813 (2019)

  48. [48]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Wang, X., Cai, Z., Gao, D., Vasconcelos, N.: Towards universal object detection by domain attention. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7289–7298 (2019)

  49. [49]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Zhou, X., Koltun, V., Kr¨ ahenb¨ uhl, P.: Simple multi-dataset detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7571–7580 (2022)

  50. [50]

    International Conference on Learning Representations (2025)

    Jin, P., Zhu, B., Yuan, L., Yan, S.: Moe++: Accelerating mixture-of-experts methods with zero-computation experts. International Conference on Learning Representations (2025)

  51. [51]

    Mittal, S., Bengio, Y., Lajoie, G.: Is a modular architecture enough? Advances in Neural Information Processing Systems35, 28747–28760 (2022)

  52. [52]

    arXiv preprint arXiv:2202.13914 (2022)

    Ponti, E.M., Sordoni, A., Bengio, Y., Reddy, S.: Combining modular skills in multitask learning. arXiv preprint arXiv:2202.13914 (2022)

  53. [53]

    arXiv preprint arXiv:2407.19610 (2024)

    Al-Maamari, M., Amor, M.B., Granitzer, M.: Mixture of modular experts: Dis- tilling knowledge from a multilingual teacher into specialized modular language models. arXiv preprint arXiv:2407.19610 (2024)

  54. [54]

    International Conference on Learning Representations (2021)

    Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N.: Gshard: Scaling giant models with conditional computation and automatic sharding. International Conference on Learning Representations (2021)

  55. [55]

    International Conference on Learning Representations (2017)

    Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J.: Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. International Conference on Learning Representations (2017)

  56. [56]

    Advances in Neural Information Processing Systems35, 7103–7114 (2022)

    Zhou, Y., Lei, T., Liu, H., Du, N., Huang, Y., Zhao, V., Dai, A.M., Le, Q.V., 33 Laudon, J.,et al.: Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems35, 7103–7114 (2022)

  57. [57]

    In: International Conference on Machine Learning, pp

    Clark, A., Las Casas, D., Guy, A., Mensch, A., Paganini, M., Hoffmann, J., Damoc, B., Hechtman, B., Cai, T., Borgeaud, S.,et al.: Unified scaling laws for routed language models. In: International Conference on Machine Learning, pp. 4057–4086 (2022)

  58. [58]

    International Conference on Learning Representations (2022)

    Liu, T., Puigcerver, J., Blondel, M.: Sparsity-constrained optimal transport. International Conference on Learning Representations (2022)

  59. [59]

    Journal of Machine Learning Research23(120), 1–39 (2022)

    Fedus, W., Zoph, B., Shazeer, N.: Switch transformers: Scaling to trillion param- eter models with simple and efficient sparsity. Journal of Machine Learning Research23(120), 1–39 (2022)

  60. [60]

    Sparsely activated mixture-of-experts are robust multi-task learners.arXiv preprint arXiv:2204.07689, 2022

    Gupta, S., Mukherjee, S., Subudhi, K., Gonzalez, E., Jose, D., Awadallah, A.H., Gao, J.: Sparsely activated mixture-of-experts are robust multi-task learners. arXiv preprint arXiv:2204.07689 (2022)

  61. [61]

    Association for Computational Linguistics (2024)

    Huang, Q., An, Z., Zhuang, N., Tao, M., Zhang, C., Jin, Y., Xu, K., Chen, L., Huang, S., Feng, Y.: Harder tasks need more experts: Dynamic routing in moe models. Association for Computational Linguistics (2024)

  62. [62]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Sanchez Aimar, E., Jonnarth, A., Felsberg, M., Kuhlmann, M.: Balanced product of calibrated experts for long-tailed recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19967–19977 (2023)

  63. [63]

    In: European Conference on Computer Vision, pp

    Sanchez Aimar, E., Helgesen, N., Xu, Y., Kuhlmann, M., Felsberg, M.: Flexible distribution alignment: Towards long-tailed semi-supervised learning with proper calibration. In: European Conference on Computer Vision, pp. 307–327 (2024)

  64. [64]

    In: European Conference on Computer Vision, pp

    Bansal, A., Sikka, K., Sharma, G., Chellappa, R., Divakaran, A.: Zero-shot object detection. In: European Conference on Computer Vision, pp. 384–400 (2018)

  65. [65]

    In: Conference on Empirical Methods in Natural Language Processing, pp

    Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word repre- sentation. In: Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)

  66. [66]

    International Conference on Learning Representations (2022)

    Gu, X., Lin, T.-Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. International Conference on Learning Representations (2022)

  67. [67]

    In: European Conference on Computer Vision, pp

    Zhou, X., Girdhar, R., Joulin, A., Kr¨ ahenb¨ uhl, P., Misra, I.: Detecting twenty- thousand classes using image-level supervision. In: European Conference on Computer Vision, pp. 350–368 (2022)

  68. [68]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)

    Du, Y., Wei, F., Zhang, Z., Shi, M., Gao, Y., Li, G.: Learning to prompt for 34 open-vocabulary object detection with vision-language model. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)

  69. [69]

    In: Advances in Neural Information Processing Systems Workshop (2024)

    Zareapoor, M., Shamsolmoali, P., Lu, Y.: Learning region-word alignment with attentive masking for open-vocabulary object detection. In: Advances in Neural Information Processing Systems Workshop (2024)

  70. [70]

    arXiv preprint arXiv:2311.11646 (2024)

    Li, Y., Guo, W., Yang, X., Liao, N., He, D., Zhou, J., Yu, W.: Toward open vocabulary aerial object detection with clip-activated student-teacher learning. arXiv preprint arXiv:2311.11646 (2024)

  71. [71]

    In: International Conference on Machine Learning (2021)

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)

  72. [72]

    Naval research logistics quarterly2(1-2), 83–97 (1955)

    Kuhn, H.W.: The hungarian method for the assignment problem. Naval research logistics quarterly2(1-2), 83–97 (1955)

  73. [73]

    IEEE Transactions on Geoscience and Remote Sensing54(12), 7405–7415 (2016)

    Cheng, G., Zhou, P., Han, J.: Learning rotation-invariant convolutional neu- ral networks for object detection in vhr optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing54(12), 7405–7415 (2016)

  74. [74]

    Advances in neural information processing systems28(2015)

    Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detec- tion with region proposal networks. Advances in neural information processing systems28(2015)

  75. [75]

    International Conference on Learning Representations (2023)

    Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.-Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. International Conference on Learning Representations (2023)

  76. [76]

    ISPRS Journal of Photogrammetry and Remote Sensing190, 79–93 (2022)

    Xu, C., Wang, J., Yang, W., Yu, H., Yu, L., Xia, G.-S.: Detecting tiny objects in aerial images: A normalized wasserstein distance and a new benchmark. ISPRS Journal of Photogrammetry and Remote Sensing190, 79–93 (2022)

  77. [77]

    IEEE Transactions on Image Processing29, 1030–1044 (2020)

    Lee, H., Eum, S., Kwon, H.: Me r-cnn: Multi-expert r-cnn for object detection. IEEE Transactions on Image Processing29, 1030–1044 (2020)

  78. [78]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Huang, P., Han, J., Cheng, D., Zhang, D.: Robust region feature synthesizer for zero-shot object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7622–7631 (2022)

  79. [79]

    In: International Conference on Pattern Recognition (2021)

    Wang, J., Yang, W., Guo, H., Zhang, R., Xia, G.-S.: Tiny object detection in aerial images. In: International Conference on Pattern Recognition (2021)

  80. [80]

    International Conference on Learning Representations (2024) 35

    Puigcerver, J., Riquelme, C., Mustafa, B., Houlsby, N.: From sparse to soft mix- tures of experts. International Conference on Learning Representations (2024) 35