pith. sign in

arxiv: 2605.21007 · v1 · pith:B732BSCXnew · submitted 2026-05-20 · 💻 cs.CV · cs.RO

LiteViLNet: Lightweight Vision-LiDAR Fusion Network for Efficient Road Segmentation

Pith reviewed 2026-05-21 05:53 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords road segmentationvision-LiDAR fusionlightweight networkmulti-modal perceptionautonomous drivingreal-time inferenceKITTI Road dataset
0
0 comments X

The pith

LiteViLNet fuses vision and LiDAR in a lightweight network to reach 96.36% MaxF score with only 14.04M parameters for road segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LiteViLNet to meet the dual needs of high accuracy and real-time speed in road segmentation for autonomous driving on devices with limited compute resources. It processes RGB images and LiDAR point clouds through a dual-stream lightweight encoder that relies on depth-wise separable convolutions to extract features while keeping the total parameter count low. Cross-modal information is combined at multiple scales using the Multi-Scale Feature Fusion Module, and long-range dependencies are modeled efficiently by the large-kernel-bridge module. Experiments on the KITTI Road dataset show the resulting model outperforms other CNN-based approaches and matches larger transformer models in accuracy while delivering much higher inference speeds suitable for embedded hardware.

Core claim

LiteViLNet is a lightweight multi-modal network that fuses RGB texture information and LiDAR geometric information for road segmentation. It uses a dual-stream lightweight encoder with depth-wise separable convolutions, a Multi-Scale Feature Fusion Module to enable cross-modal interaction at different levels, and a large-kernel-bridge module to capture long-range dependencies with linear complexity. This combination attains a 96.36% MaxF score with only 14.04M parameters, ranking best among CNN-based methods and comparable to larger transformer-based models on the KITTI Road dataset, while running at 163.79 FPS on RTX 4060 Ti and 22.18 FPS on Jetson Orin NX.

What carries the argument

The Multi-Scale Feature Fusion Module for cross-modal interaction at multiple scales together with the large-kernel-bridge module for efficient long-range dependency capture.

If this is right

  • The model supports real-time road segmentation on resource-constrained embedded platforms such as the Jetson Orin NX for autonomous driving.
  • CNN-based designs can compete with transformer-based models in accuracy for this task without high computational costs.
  • The approach validates practical deployment of lightweight multi-modal networks in intelligent robotic systems and real-world applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar lightweight fusion modules could be tested on other multi-modal perception tasks such as object detection or semantic segmentation in varied environments.
  • The linear complexity of the large-kernel module may allow the network to scale to higher-resolution inputs or video streams with limited additional cost.
  • Evaluating the same architecture on datasets that include adverse weather or different sensor calibrations would clarify robustness beyond the KITTI Road benchmark.

Load-bearing premise

The accuracy-efficiency balance on the KITTI Road dataset results from the specific designs of the Multi-Scale Feature Fusion Module and large-kernel-bridge module rather than from training details or dataset properties.

What would settle it

An ablation experiment that removes the Multi-Scale Feature Fusion Module and large-kernel-bridge module and records a substantial drop in MaxF score below 96% while keeping training and data the same would show whether those modules drive the reported tradeoff.

Figures

Figures reproduced from arXiv: 2605.21007 by Bingtao Wang, Daojie Peng, Fulong Ma, Jun Ma, Liang Zhang.

Figure 1
Figure 1. Figure 1: Overall Architecture of LiteViLNet. The network consists of a dual-stream lightweight encoder, a multi-scale feature fusion module, a large-kernel-bridge module, and a decoder with deep supervision. the RGB stream uses a pre-trained MobileNetV3-Large [23] backbone, and the LiDAR stream uses a tiny encoder based on depth-wise separable convolutions. This allows us to extract multi-scale features from both m… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the ADI Generation Pipeline. This process converts the raw 3D LiDAR point cloud into a 2D geometric feature map, which encodes the local height difference between the ground plane and obstacles to provide strong geometric cues for road segmentation. III. METHOD In this section, we present the details of the proposed LiteViLNet framework. The overall architecture is illustrated in [PITH_FUL… view at source ↗
Figure 3
Figure 3. Figure 3: Overall Architecture of the Proposed MSFM. It sequentially conducts channel dimension compression, intra-modal feature enhancement via ECA and coordinate attention, bidirectional cross-modal attention interaction, and adaptive gated feature fusion to effectively integrate comple￾mentary RGB texture and LiDAR geometric information at individual feature scales. To effectively fuse the features from the two m… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Segmentation Results on the KITTI Road Validation Set. Each row shows (a) the input RGB image, (b) the corresponding Altitude Difference Image (ADI) derived from LiDAR depth data, (c) the segmentation prediction of LiteViLNet, and (d) the error map visualizing true positives (TP, green), false positives (FP, red), and false negatives (FN, blue). Quantitative metrics including F1-score and IoU a… view at source ↗
Figure 5
Figure 5. Figure 5: Real-world Deployment on Different Robots. LEFT: Kuafu Delivery Vehicle, MIDDLE: Unitree-B2, RIGHT: Unitree-G1. Left column of each case shows the first-person perception pipeline of LiteViLNet: RGB image, depth map, drivable area segmentation mask, and walkable confidence heatmap. Right column shows the robot navigating autonomously using our lightweight perception system [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 6
Figure 6. Figure 6: First-person Perception Pipeline of LiteViLNet on the Kuafu Delivery Vehicle. The panels show: (a) raw RGB image from the Orbbec Gemini 336L camera, (b) corresponding depth map, (c) drivable area segmentation mask, and (d) walkable confidence heatmap overlaid with the planned robot trajectory. The bottom legend indicates the robot running path, demonstrating that LiteViLNet pro￾vides stable and accurate ro… view at source ↗
read the original abstract

Road segmentation is a fundamental perception task for autonomous driving and intelligent robotic systems, requiring both high accuracy and real-time inference, especially for deployment on resource-constrained edge devices. Existing multi-modal road segmentation methods often rely on heavy transformer-based encoders to achieve state-of-the-art performance, but their enormous computational cost prohibits real-time deployment on embedded platforms. To address this dilemma, we propose \textbf{LiteViLNet}, a lightweight multi-modal network that fuses RGB texture information and LiDAR geometric information for efficient road segmentation. Specifically, we design a dual-stream lightweight encoder and depth-wise separable convolutions to extract hierarchical features from both modalities with minimal parameters. We further propose a Multi-Scale Feature Fusion Module (MSFM) to facilitate cross-modal interaction at different levels, and a large-kernel-bridge module to capture long-range dependencies with linear complexity. Extensive experiments on the KITTI Road dataset and real-world applications demonstrate that LiteViLNet achieves a promising balance between accuracy and efficiency. Notably, with only 14.04M parameters, our model attains a 96.36\% MaxF score, ranking the best among all CNN-based methods and being comparable to larger transformer-based models, and runs at 163.79 FPS in model-only inference on RTX 4060 Ti (22.18 FPS on Jetson Orin NX). It outperforms numerous heavy-weight methods in inference speed while maintaining highly competitive accuracy, fully validating the potential of LiteViLNet for real-time embedded deployment in autonomous driving and intelligent robotics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LiteViLNet, a lightweight dual-stream CNN for RGB-LiDAR fusion in road segmentation. It employs depth-wise separable convolutions in the encoders, a Multi-Scale Feature Fusion Module (MSFM) for cross-modal interaction at multiple levels, and a large-kernel-bridge module for long-range dependencies with linear complexity. On the KITTI Road benchmark, the model with 14.04M parameters is reported to achieve 96.36% MaxF (best among CNN-based methods, comparable to larger transformers) while running at 163.79 FPS on RTX 4060 Ti and 22.18 FPS on Jetson Orin NX.

Significance. If the performance gains can be shown to stem from the proposed MSFM and large-kernel-bridge rather than training-protocol differences, the work would offer a practically significant advance for real-time multi-modal perception on edge devices, demonstrating that carefully designed lightweight CNNs can close much of the accuracy gap with heavier transformer models without prohibitive compute.

major comments (2)
  1. [Experiments] Experiments section: no ablation studies are presented that isolate the contribution of the MSFM or large-kernel-bridge module (e.g., by removing each and re-training under identical conditions). Without these, it is impossible to verify that the 96.36% MaxF and efficiency balance arise from the architectural innovations rather than optimizer, augmentation, or schedule choices.
  2. [§4] Comparison table (presumably Table 1 or equivalent in §4): MaxF and FPS numbers for prior CNN and transformer methods are taken directly from the original publications without re-implementation under a matched protocol (identical epochs, learning-rate schedule, input resolution, and test split). This leaves open the possibility that reported gaps are explained by experimental-setup differences rather than the dual-stream encoder + MSFM design.
minor comments (2)
  1. [Abstract] The abstract states results on 'real-world applications' but the main text should explicitly indicate whether these are only qualitative visualizations or include quantitative metrics on additional datasets.
  2. [Method] Notation for the large-kernel-bridge module should be clarified (e.g., explicit definition of kernel size, dilation, and how linear complexity is obtained) to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the manuscript. We address the major comments point by point below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: no ablation studies are presented that isolate the contribution of the MSFM or large-kernel-bridge module (e.g., by removing each and re-training under identical conditions). Without these, it is impossible to verify that the 96.36% MaxF and efficiency balance arise from the architectural innovations rather than optimizer, augmentation, or schedule choices.

    Authors: We fully agree with this observation. The current manuscript lacks explicit ablation studies to isolate the effects of the MSFM and large-kernel-bridge modules. To address this, we will conduct and include new ablation experiments in the revised version. Specifically, we will train variants without MSFM and without the large-kernel-bridge under the exact same training protocol, hyperparameters, and data augmentations as the full model. These results will be added to the Experiments section to demonstrate the contribution of each component. revision: yes

  2. Referee: [§4] Comparison table (presumably Table 1 or equivalent in §4): MaxF and FPS numbers for prior CNN and transformer methods are taken directly from the original publications without re-implementation under a matched protocol (identical epochs, learning-rate schedule, input resolution, and test split). This leaves open the possibility that reported gaps are explained by experimental-setup differences rather than the dual-stream encoder + MSFM design.

    Authors: This is a valid point regarding the comparability of results. While we reported the numbers from the original papers as is common in the literature to avoid the prohibitive cost of re-implementing every method, we recognize that differences in training setups could influence the outcomes. In the revised manuscript, we will include a dedicated paragraph in the discussion or experiments section acknowledging these potential discrepancies and noting that all methods are evaluated on the same KITTI Road test set with standard metrics. Additionally, we will attempt to re-implement and re-train one or two representative methods under our protocol if resources permit, or at minimum provide more details on the training configurations used in the original works for better context. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external KITTI evaluation

full rationale

The paper introduces a dual-stream lightweight encoder, Multi-Scale Feature Fusion Module (MSFM), and large-kernel-bridge module as explicit architectural proposals, then measures their effect via standard MaxF and FPS on the public KITTI Road dataset. These performance numbers (96.36% MaxF, 14.04M parameters, 163.79 FPS) are direct experimental outputs under fixed protocols, not quantities derived by construction from the modules themselves or from any fitted parameter that is later relabeled as a prediction. Cited baseline numbers from prior CNN and transformer papers are externally reproducible on the same public benchmark and therefore constitute independent evidence rather than a self-citation chain that collapses the central claim. No self-definitional equations, uniqueness theorems imported from the authors' prior work, or ansatz smuggling appear in the derivation; the accuracy-efficiency balance is therefore an empirical finding, not a tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The central performance claim rests on the effectiveness of the custom fusion and bridging modules plus the assumption that KITTI results generalize; the paper introduces two new modules whose value is demonstrated only internally.

free parameters (1)
  • Channel counts, kernel sizes, and fusion scales in MSFM and encoders
    Hand-chosen architectural hyperparameters that directly affect the reported parameter count and speed-accuracy numbers.
axioms (2)
  • domain assumption KITTI Road dataset is representative of real-world road segmentation conditions for autonomous driving.
    All reported MaxF and FPS figures depend on this dataset reflecting deployment scenarios.
  • domain assumption Standard CNN optimization converges to a solution whose metrics reflect the architectural contributions rather than training artifacts.
    No explicit verification of training stability or multiple runs is visible in the abstract.
invented entities (2)
  • Multi-Scale Feature Fusion Module (MSFM) no independent evidence
    purpose: Enable cross-modal interaction at multiple feature levels between vision and LiDAR streams.
    New module introduced by the paper; no independent evidence outside the reported experiments.
  • large-kernel-bridge module no independent evidence
    purpose: Capture long-range dependencies with linear computational complexity.
    New module introduced by the paper; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5813 in / 1722 out tokens · 94295 ms · 2026-05-21T05:53:19.730121+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 1 internal anchor

  1. [1]

    Deep learning-based vehicle behavior prediction for autonomous driving applications: A review,

    S. Mozaffari, O. Al-Jarrah, M. Dianati, P. Jennings, and A. Mouzakitis, “Deep learning-based vehicle behavior prediction for autonomous driving applications: A review,”IEEE Trans- actions on Intelligent Transportation Systems, vol. 23, no. 1, pp. 33–47, 2022

  2. [2]

    Rod: Rgb-only fast and efficient off-road freespace detection,

    T. Sun et al., “Rod: Rgb-only fast and efficient off-road freespace detection,” in2025 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2025, pp. 9787– 9793

  3. [3]

    Rangenet++: Fast and accurate lidar semantic segmentation,

    A. Milioto, I. Vizzo, J. Behley, and C. Stachniss, “Rangenet++: Fast and accurate lidar semantic segmentation,” in2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), IEEE, 2019, pp. 4213–4220

  4. [4]

    Sne-roadseg: Incorpo- rating surface normal information into semantic segmentation for accurate freespace detection,

    R. Fan, H. Wang, P. Cai, and M. Liu, “Sne-roadseg: Incorpo- rating surface normal information into semantic segmentation for accurate freespace detection,” inEuropean Conference on Computer Vision, Springer, 2020, pp. 340–356

  5. [5]

    Progressive LiDAR adaptation for road detection,

    Z. Chen, J. Zhang, and D. Tao, “Progressive LiDAR adaptation for road detection,”IEEE/CAA Journal of Automatica Sinica, vol. 6, no. 3, pp. 693–702, 2019

  6. [6]

    Orfd: A dataset and benchmark for off- road freespace detection,

    C. Min et al., “Orfd: A dataset and benchmark for off- road freespace detection,” in2022 international conference on robotics and automation (ICRA), IEEE, 2022, pp. 2532–2538

  7. [7]

    Curbnet: Curb detection framework based on lidar point cloud seg- mentation,

    G. Zhao, F. Ma, W. Qi, Y . Liu, M. Liu, and J. Ma, “Curbnet: Curb detection framework based on lidar point cloud seg- mentation,”IEEE Transactions on Intelligent Transportation Systems, 2025

  8. [8]

    Self- supervised drivable area segmentation using lidar’s depth information for autonomous driving,

    F. Ma, Y . Liu, S. Wang, J. Wu, W. Qi, and M. Liu, “Self- supervised drivable area segmentation using lidar’s depth information for autonomous driving,” in2023 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS), IEEE, 2023, pp. 41–48

  9. [9]

    Annotation-free detection of drivable areas and curbs leveraging lidar point cloud maps,

    F. Ma, D. Peng, and J. Ma, “Annotation-free detection of drivable areas and curbs leveraging lidar point cloud maps,” arXiv preprint arXiv:2603.27553, 2026

  10. [10]

    Pidnet: A real-time semantic segmentation network inspired by pid controllers,

    J. Xu, Z. Xiong, and S. P. Bhattacharyya, “Pidnet: A real-time semantic segmentation network inspired by pid controllers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 529–19 539

  11. [11]

    Lovon: Legged open-vocabulary object navigator,

    D. Peng, J. Cao, Q. Zhang, and J. Ma, “Lovon: Legged open-vocabulary object navigator,”arXiv preprint arXiv:2507.06747, 2025

  12. [12]

    Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers,

    J. Zhang, H. Liu, K. Yang, X. Hu, R. Liu, and R. Stiefelhagen, “Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers,”IEEE Transactions on Intelligent Trans- portation Systems, vol. 24, no. 12, pp. 14 679–14 694, 2023

  13. [13]

    Sne-roadsegv2: Advancing heterogeneous feature fusion and fallibility awareness for freespace detec- tion,

    Y . Feng et al., “Sne-roadsegv2: Advancing heterogeneous feature fusion and fallibility awareness for freespace detec- tion,”IEEE Transactions on Instrumentation and Measure- ment, vol. 74, pp. 1–9, 2025

  14. [14]

    Roadformer: Duplex transformer for rgb-normal semantic road scene parsing,

    J. Li, Y . Zhang, P. Yun, G. Zhou, Q. Chen, and R. Fan, “Roadformer: Duplex transformer for rgb-normal semantic road scene parsing,”IEEE Transactions on Intelligent V ehicles, vol. 9, no. 7, pp. 5163–5172, 2024

  15. [15]

    Annotation- free curb detection leveraging altitude difference image,

    F. Ma, P. Hou, Y . Liu, Y . Liu, M. Liu, and J. Ma, “Annotation- free curb detection leveraging altitude difference image,” in 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2025, pp. 762–768

  16. [16]

    Swin transformer: Hierarchical vision trans- former using shifted windows,

    Z. Liu et al., “Swin transformer: Hierarchical vision trans- former using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9992–10 002

  17. [17]

    Twinlitenet+: An enhanced multi-task segmentation model for autonomous driving,

    Q.-H. Che, D.-T. Le, M.-Q. Pham, V .-T. Nguyen, and D.-K. Lam, “Twinlitenet+: An enhanced multi-task segmentation model for autonomous driving,”Computers and Electrical Engineering, vol. 128, p. 110 694, 2025

  18. [18]

    Knowledge generation and distillation for road segmentation in intelligent transportation systems,

    M. Li, J. Wang, and H. Chen, “Knowledge generation and distillation for road segmentation in intelligent transportation systems,”IEEE Transactions on Intelligent Transportation Systems, 2025

  19. [19]

    Lrdnet: Lightweight lidar aided cascaded feature pools for free road space detection,

    A. A. Khan, J. Shao, Y . Rao, L. She, and H. T. Shen, “Lrdnet: Lightweight lidar aided cascaded feature pools for free road space detection,”IEEE Transactions on Multimedia, vol. 27, pp. 652–664, 2025

  20. [20]

    Fast road segmentation via uncertainty-aware symmetric network,

    Y . Chang, F. Xue, F. Sheng, W. Liang, and A. Ming, “Fast road segmentation via uncertainty-aware symmetric network,” in2022 International Conference on Robotics and Automation (ICRA), IEEE, 2022, pp. 11 124–11 130

  21. [21]

    Sdfnet for real-time semantic segmenta- tion on urban road images,

    Y . Cao and H. Qu, “Sdfnet for real-time semantic segmenta- tion on urban road images,”IAENG International Journal of Computer Science, vol. 52, no. 12, pp. 4815–4821, 2025

  22. [22]

    Lcire-net: Lightweight cross-modal infor- mation interaction for road feature extraction from remote sensing images and gps trajectory/lidar,

    Y . Duan et al., “Lcire-net: Lightweight cross-modal infor- mation interaction for road feature extraction from remote sensing images and gps trajectory/lidar,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–18, 2025

  23. [23]

    Searching for mobilenetv3,

    A. Howard et al., “Searching for mobilenetv3,” inProceed- ings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1314–1324

  24. [24]

    Road detection based on illuminant invariance,

    J. M. Alvarez and A. M. Lopez, “Road detection based on illuminant invariance,”IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 1, pp. 184–193, 2011

  25. [25]

    Fully convolutional networks for semantic segmentation,

    J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2015, pp. 3431–3440

  26. [26]

    U-net: Convolutional networks for biomedical image segmentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inInternational Conference on Medical Image Computing and Computer- Assisted Intervention, Springer, 2015, pp. 234–241

  27. [27]

    Early fusion of camera and lidar for robust road detection based on u-net fcn,

    F. Wulff, B. Schaufele, O. Sawade, D. Becker, B. Henke, and I. Radusch, “Early fusion of camera and lidar for robust road detection based on u-net fcn,” in2018 IEEE Intelligent V ehicles Symposium (IV), IEEE, 2018, pp. 1426–1431

  28. [28]

    Cross-view transformers for real-time map-view semantic segmentation,

    B. Zhou and P. Krahenbuhl, “Cross-view transformers for real-time map-view semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 750–13 759

  29. [29]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    A. G. Howard et al., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,”arXiv preprint arXiv:1704.04861, 2017

  30. [30]

    Shufflenet v2: Practical guidelines for efficient cnn architecture design,

    N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 122–138

  31. [31]

    Espnet: Efficient spatial pyramid of dilated con- volutions for semantic segmentation,

    S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Ha- jishirzi, “Espnet: Efficient spatial pyramid of dilated con- volutions for semantic segmentation,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 561–580

  32. [32]

    Bisenet: Bilateral segmentation network for real-time seman- tic segmentation,

    C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time seman- tic segmentation,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 334–349

  33. [33]

    Eca- net: Efficient channel attention for deep convolutional neural networks,

    Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca- net: Efficient channel attention for deep convolutional neural networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 531– 11 539

  34. [34]

    Coordinate attention for effi- cient mobile network design,

    Q. Hou, D. Zhou, and J. Feng, “Coordinate attention for effi- cient mobile network design,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 708–13 717

  35. [35]

    Internimage: Exploring large-scale vision foundation models with deformable convolutions,

    W. Wang et al., “Internimage: Exploring large-scale vision foundation models with deformable convolutions,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 408–14 419

  36. [36]

    The lov ´asz- softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,

    M. Berman, A. R. Triki, and M. B. Blaschko, “The lov ´asz- softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4413–4421

  37. [37]

    Focal loss for dense object detection,

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2999–3007

  38. [38]

    Vision meets robotics: The kitti dataset,

    A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013