LiteViLNet: Lightweight Vision-LiDAR Fusion Network for Efficient Road Segmentation

Bingtao Wang; Daojie Peng; Fulong Ma; Jun Ma; Liang Zhang

arxiv: 2605.21007 · v1 · pith:B732BSCXnew · submitted 2026-05-20 · 💻 cs.CV · cs.RO

LiteViLNet: Lightweight Vision-LiDAR Fusion Network for Efficient Road Segmentation

Daojie Peng , Bingtao Wang , Fulong Ma , Liang Zhang , Jun Ma This is my paper

Pith reviewed 2026-05-21 05:53 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords road segmentationvision-LiDAR fusionlightweight networkmulti-modal perceptionautonomous drivingreal-time inferenceKITTI Road dataset

0 comments

The pith

LiteViLNet fuses vision and LiDAR in a lightweight network to reach 96.36% MaxF score with only 14.04M parameters for road segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LiteViLNet to meet the dual needs of high accuracy and real-time speed in road segmentation for autonomous driving on devices with limited compute resources. It processes RGB images and LiDAR point clouds through a dual-stream lightweight encoder that relies on depth-wise separable convolutions to extract features while keeping the total parameter count low. Cross-modal information is combined at multiple scales using the Multi-Scale Feature Fusion Module, and long-range dependencies are modeled efficiently by the large-kernel-bridge module. Experiments on the KITTI Road dataset show the resulting model outperforms other CNN-based approaches and matches larger transformer models in accuracy while delivering much higher inference speeds suitable for embedded hardware.

Core claim

LiteViLNet is a lightweight multi-modal network that fuses RGB texture information and LiDAR geometric information for road segmentation. It uses a dual-stream lightweight encoder with depth-wise separable convolutions, a Multi-Scale Feature Fusion Module to enable cross-modal interaction at different levels, and a large-kernel-bridge module to capture long-range dependencies with linear complexity. This combination attains a 96.36% MaxF score with only 14.04M parameters, ranking best among CNN-based methods and comparable to larger transformer-based models on the KITTI Road dataset, while running at 163.79 FPS on RTX 4060 Ti and 22.18 FPS on Jetson Orin NX.

What carries the argument

The Multi-Scale Feature Fusion Module for cross-modal interaction at multiple scales together with the large-kernel-bridge module for efficient long-range dependency capture.

If this is right

The model supports real-time road segmentation on resource-constrained embedded platforms such as the Jetson Orin NX for autonomous driving.
CNN-based designs can compete with transformer-based models in accuracy for this task without high computational costs.
The approach validates practical deployment of lightweight multi-modal networks in intelligent robotic systems and real-world applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar lightweight fusion modules could be tested on other multi-modal perception tasks such as object detection or semantic segmentation in varied environments.
The linear complexity of the large-kernel module may allow the network to scale to higher-resolution inputs or video streams with limited additional cost.
Evaluating the same architecture on datasets that include adverse weather or different sensor calibrations would clarify robustness beyond the KITTI Road benchmark.

Load-bearing premise

The accuracy-efficiency balance on the KITTI Road dataset results from the specific designs of the Multi-Scale Feature Fusion Module and large-kernel-bridge module rather than from training details or dataset properties.

What would settle it

An ablation experiment that removes the Multi-Scale Feature Fusion Module and large-kernel-bridge module and records a substantial drop in MaxF score below 96% while keeping training and data the same would show whether those modules drive the reported tradeoff.

Figures

Figures reproduced from arXiv: 2605.21007 by Bingtao Wang, Daojie Peng, Fulong Ma, Jun Ma, Liang Zhang.

**Figure 1.** Figure 1: Overall Architecture of LiteViLNet. The network consists of a dual-stream lightweight encoder, a multi-scale feature fusion module, a large-kernel-bridge module, and a decoder with deep supervision. the RGB stream uses a pre-trained MobileNetV3-Large [23] backbone, and the LiDAR stream uses a tiny encoder based on depth-wise separable convolutions. This allows us to extract multi-scale features from both m… view at source ↗

**Figure 2.** Figure 2: Illustration of the ADI Generation Pipeline. This process converts the raw 3D LiDAR point cloud into a 2D geometric feature map, which encodes the local height difference between the ground plane and obstacles to provide strong geometric cues for road segmentation. III. METHOD In this section, we present the details of the proposed LiteViLNet framework. The overall architecture is illustrated in [PITH_FUL… view at source ↗

**Figure 3.** Figure 3: Overall Architecture of the Proposed MSFM. It sequentially conducts channel dimension compression, intra-modal feature enhancement via ECA and coordinate attention, bidirectional cross-modal attention interaction, and adaptive gated feature fusion to effectively integrate complementary RGB texture and LiDAR geometric information at individual feature scales. To effectively fuse the features from the two m… view at source ↗

**Figure 4.** Figure 4: Qualitative Segmentation Results on the KITTI Road Validation Set. Each row shows (a) the input RGB image, (b) the corresponding Altitude Difference Image (ADI) derived from LiDAR depth data, (c) the segmentation prediction of LiteViLNet, and (d) the error map visualizing true positives (TP, green), false positives (FP, red), and false negatives (FN, blue). Quantitative metrics including F1-score and IoU a… view at source ↗

**Figure 5.** Figure 5: Real-world Deployment on Different Robots. LEFT: Kuafu Delivery Vehicle, MIDDLE: Unitree-B2, RIGHT: Unitree-G1. Left column of each case shows the first-person perception pipeline of LiteViLNet: RGB image, depth map, drivable area segmentation mask, and walkable confidence heatmap. Right column shows the robot navigating autonomously using our lightweight perception system [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 6.** Figure 6: First-person Perception Pipeline of LiteViLNet on the Kuafu Delivery Vehicle. The panels show: (a) raw RGB image from the Orbbec Gemini 336L camera, (b) corresponding depth map, (c) drivable area segmentation mask, and (d) walkable confidence heatmap overlaid with the planned robot trajectory. The bottom legend indicates the robot running path, demonstrating that LiteViLNet provides stable and accurate ro… view at source ↗

read the original abstract

Road segmentation is a fundamental perception task for autonomous driving and intelligent robotic systems, requiring both high accuracy and real-time inference, especially for deployment on resource-constrained edge devices. Existing multi-modal road segmentation methods often rely on heavy transformer-based encoders to achieve state-of-the-art performance, but their enormous computational cost prohibits real-time deployment on embedded platforms. To address this dilemma, we propose \textbf{LiteViLNet}, a lightweight multi-modal network that fuses RGB texture information and LiDAR geometric information for efficient road segmentation. Specifically, we design a dual-stream lightweight encoder and depth-wise separable convolutions to extract hierarchical features from both modalities with minimal parameters. We further propose a Multi-Scale Feature Fusion Module (MSFM) to facilitate cross-modal interaction at different levels, and a large-kernel-bridge module to capture long-range dependencies with linear complexity. Extensive experiments on the KITTI Road dataset and real-world applications demonstrate that LiteViLNet achieves a promising balance between accuracy and efficiency. Notably, with only 14.04M parameters, our model attains a 96.36\% MaxF score, ranking the best among all CNN-based methods and being comparable to larger transformer-based models, and runs at 163.79 FPS in model-only inference on RTX 4060 Ti (22.18 FPS on Jetson Orin NX). It outperforms numerous heavy-weight methods in inference speed while maintaining highly competitive accuracy, fully validating the potential of LiteViLNet for real-time embedded deployment in autonomous driving and intelligent robotics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LiteViLNet delivers a practical lightweight vision-LiDAR fusion model for road segmentation with solid efficiency numbers on KITTI, but the gains may partly trace to training details rather than the new modules alone.

read the letter

This paper introduces LiteViLNet, a dual-stream lightweight network that fuses RGB and LiDAR for road segmentation. It reports 96.36% MaxF with 14M parameters and runs at 163 FPS on an RTX 4060 Ti or 22 FPS on Jetson Orin NX, which is the kind of result that matters for real-time edge deployment in autonomous driving. The central contribution is the concrete architecture: a depth-wise separable encoder pair, the Multi-Scale Feature Fusion Module for cross-modal interaction at multiple scales, and the large-kernel-bridge for long-range context at linear cost. These choices keep the model small while staying competitive with heavier transformer baselines on the standard KITTI Road benchmark. The focus on practical metrics like MaxF and FPS, plus the embedded hardware numbers, gives the work clear applied value. The authors stay within the established lightweight CNN fusion line rather than claiming a fundamental shift, and that framing matches what they actually deliver. The main soft spot is experimental control. If the comparison table pulls MaxF numbers straight from prior papers without re-training those baselines under identical epochs, learning rate, resolution, and test split, then the ranking could partly reflect protocol differences instead of the MSFM or bridge modules. Small setup changes often move MaxF by 1-2 points on this dataset, so matched re-implementations would tighten the claim. Ablations isolating each new component are also missing from the visible description, which leaves the source of the accuracy-efficiency balance less clear than it could be. The evaluation uses an external public dataset and standard metrics, so there is no circularity issue. This paper is for engineers and researchers who need a deployable multi-modal road segmentation model that runs on constrained hardware without heavy compute. A reader looking for implementation ideas and real-world speed-accuracy tradeoffs will get direct use from the design. It is coherent enough on its own terms to deserve peer review, even if the current evidence for the modules as the primary driver is only moderate. I would send it to referees and ask specifically for matched baseline runs and module ablations.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LiteViLNet, a lightweight dual-stream CNN for RGB-LiDAR fusion in road segmentation. It employs depth-wise separable convolutions in the encoders, a Multi-Scale Feature Fusion Module (MSFM) for cross-modal interaction at multiple levels, and a large-kernel-bridge module for long-range dependencies with linear complexity. On the KITTI Road benchmark, the model with 14.04M parameters is reported to achieve 96.36% MaxF (best among CNN-based methods, comparable to larger transformers) while running at 163.79 FPS on RTX 4060 Ti and 22.18 FPS on Jetson Orin NX.

Significance. If the performance gains can be shown to stem from the proposed MSFM and large-kernel-bridge rather than training-protocol differences, the work would offer a practically significant advance for real-time multi-modal perception on edge devices, demonstrating that carefully designed lightweight CNNs can close much of the accuracy gap with heavier transformer models without prohibitive compute.

major comments (2)

[Experiments] Experiments section: no ablation studies are presented that isolate the contribution of the MSFM or large-kernel-bridge module (e.g., by removing each and re-training under identical conditions). Without these, it is impossible to verify that the 96.36% MaxF and efficiency balance arise from the architectural innovations rather than optimizer, augmentation, or schedule choices.
[§4] Comparison table (presumably Table 1 or equivalent in §4): MaxF and FPS numbers for prior CNN and transformer methods are taken directly from the original publications without re-implementation under a matched protocol (identical epochs, learning-rate schedule, input resolution, and test split). This leaves open the possibility that reported gaps are explained by experimental-setup differences rather than the dual-stream encoder + MSFM design.

minor comments (2)

[Abstract] The abstract states results on 'real-world applications' but the main text should explicitly indicate whether these are only qualitative visualizations or include quantitative metrics on additional datasets.
[Method] Notation for the large-kernel-bridge module should be clarified (e.g., explicit definition of kernel size, dilation, and how linear complexity is obtained) to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the manuscript. We address the major comments point by point below and outline the revisions we will make.

read point-by-point responses

Referee: [Experiments] Experiments section: no ablation studies are presented that isolate the contribution of the MSFM or large-kernel-bridge module (e.g., by removing each and re-training under identical conditions). Without these, it is impossible to verify that the 96.36% MaxF and efficiency balance arise from the architectural innovations rather than optimizer, augmentation, or schedule choices.

Authors: We fully agree with this observation. The current manuscript lacks explicit ablation studies to isolate the effects of the MSFM and large-kernel-bridge modules. To address this, we will conduct and include new ablation experiments in the revised version. Specifically, we will train variants without MSFM and without the large-kernel-bridge under the exact same training protocol, hyperparameters, and data augmentations as the full model. These results will be added to the Experiments section to demonstrate the contribution of each component. revision: yes
Referee: [§4] Comparison table (presumably Table 1 or equivalent in §4): MaxF and FPS numbers for prior CNN and transformer methods are taken directly from the original publications without re-implementation under a matched protocol (identical epochs, learning-rate schedule, input resolution, and test split). This leaves open the possibility that reported gaps are explained by experimental-setup differences rather than the dual-stream encoder + MSFM design.

Authors: This is a valid point regarding the comparability of results. While we reported the numbers from the original papers as is common in the literature to avoid the prohibitive cost of re-implementing every method, we recognize that differences in training setups could influence the outcomes. In the revised manuscript, we will include a dedicated paragraph in the discussion or experiments section acknowledging these potential discrepancies and noting that all methods are evaluated on the same KITTI Road test set with standard metrics. Additionally, we will attempt to re-implement and re-train one or two representative methods under our protocol if resources permit, or at minimum provide more details on the training configurations used in the original works for better context. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external KITTI evaluation

full rationale

The paper introduces a dual-stream lightweight encoder, Multi-Scale Feature Fusion Module (MSFM), and large-kernel-bridge module as explicit architectural proposals, then measures their effect via standard MaxF and FPS on the public KITTI Road dataset. These performance numbers (96.36% MaxF, 14.04M parameters, 163.79 FPS) are direct experimental outputs under fixed protocols, not quantities derived by construction from the modules themselves or from any fitted parameter that is later relabeled as a prediction. Cited baseline numbers from prior CNN and transformer papers are externally reproducible on the same public benchmark and therefore constitute independent evidence rather than a self-citation chain that collapses the central claim. No self-definitional equations, uniqueness theorems imported from the authors' prior work, or ansatz smuggling appear in the derivation; the accuracy-efficiency balance is therefore an empirical finding, not a tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The central performance claim rests on the effectiveness of the custom fusion and bridging modules plus the assumption that KITTI results generalize; the paper introduces two new modules whose value is demonstrated only internally.

free parameters (1)

Channel counts, kernel sizes, and fusion scales in MSFM and encoders
Hand-chosen architectural hyperparameters that directly affect the reported parameter count and speed-accuracy numbers.

axioms (2)

domain assumption KITTI Road dataset is representative of real-world road segmentation conditions for autonomous driving.
All reported MaxF and FPS figures depend on this dataset reflecting deployment scenarios.
domain assumption Standard CNN optimization converges to a solution whose metrics reflect the architectural contributions rather than training artifacts.
No explicit verification of training stability or multiple runs is visible in the abstract.

invented entities (2)

Multi-Scale Feature Fusion Module (MSFM) no independent evidence
purpose: Enable cross-modal interaction at multiple feature levels between vision and LiDAR streams.
New module introduced by the paper; no independent evidence outside the reported experiments.
large-kernel-bridge module no independent evidence
purpose: Capture long-range dependencies with linear computational complexity.
New module introduced by the paper; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5813 in / 1722 out tokens · 94295 ms · 2026-05-21T05:53:19.730121+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 1 internal anchor

[1]

Deep learning-based vehicle behavior prediction for autonomous driving applications: A review,

S. Mozaffari, O. Al-Jarrah, M. Dianati, P. Jennings, and A. Mouzakitis, “Deep learning-based vehicle behavior prediction for autonomous driving applications: A review,”IEEE Trans- actions on Intelligent Transportation Systems, vol. 23, no. 1, pp. 33–47, 2022

work page 2022
[2]

Rod: Rgb-only fast and efficient off-road freespace detection,

T. Sun et al., “Rod: Rgb-only fast and efficient off-road freespace detection,” in2025 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2025, pp. 9787– 9793

work page 2025
[3]

Rangenet++: Fast and accurate lidar semantic segmentation,

A. Milioto, I. Vizzo, J. Behley, and C. Stachniss, “Rangenet++: Fast and accurate lidar semantic segmentation,” in2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), IEEE, 2019, pp. 4213–4220

work page 2019
[4]

Sne-roadseg: Incorpo- rating surface normal information into semantic segmentation for accurate freespace detection,

R. Fan, H. Wang, P. Cai, and M. Liu, “Sne-roadseg: Incorpo- rating surface normal information into semantic segmentation for accurate freespace detection,” inEuropean Conference on Computer Vision, Springer, 2020, pp. 340–356

work page 2020
[5]

Progressive LiDAR adaptation for road detection,

Z. Chen, J. Zhang, and D. Tao, “Progressive LiDAR adaptation for road detection,”IEEE/CAA Journal of Automatica Sinica, vol. 6, no. 3, pp. 693–702, 2019

work page 2019
[6]

Orfd: A dataset and benchmark for off- road freespace detection,

C. Min et al., “Orfd: A dataset and benchmark for off- road freespace detection,” in2022 international conference on robotics and automation (ICRA), IEEE, 2022, pp. 2532–2538

work page 2022
[7]

Curbnet: Curb detection framework based on lidar point cloud seg- mentation,

G. Zhao, F. Ma, W. Qi, Y . Liu, M. Liu, and J. Ma, “Curbnet: Curb detection framework based on lidar point cloud seg- mentation,”IEEE Transactions on Intelligent Transportation Systems, 2025

work page 2025
[8]

Self- supervised drivable area segmentation using lidar’s depth information for autonomous driving,

F. Ma, Y . Liu, S. Wang, J. Wu, W. Qi, and M. Liu, “Self- supervised drivable area segmentation using lidar’s depth information for autonomous driving,” in2023 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS), IEEE, 2023, pp. 41–48

work page 2023
[9]

Annotation-free detection of drivable areas and curbs leveraging lidar point cloud maps,

F. Ma, D. Peng, and J. Ma, “Annotation-free detection of drivable areas and curbs leveraging lidar point cloud maps,” arXiv preprint arXiv:2603.27553, 2026

work page arXiv 2026
[10]

Pidnet: A real-time semantic segmentation network inspired by pid controllers,

J. Xu, Z. Xiong, and S. P. Bhattacharyya, “Pidnet: A real-time semantic segmentation network inspired by pid controllers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 529–19 539

work page 2023
[11]

Lovon: Legged open-vocabulary object navigator,

D. Peng, J. Cao, Q. Zhang, and J. Ma, “Lovon: Legged open-vocabulary object navigator,”arXiv preprint arXiv:2507.06747, 2025

work page arXiv 2025
[12]

Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers,

J. Zhang, H. Liu, K. Yang, X. Hu, R. Liu, and R. Stiefelhagen, “Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers,”IEEE Transactions on Intelligent Trans- portation Systems, vol. 24, no. 12, pp. 14 679–14 694, 2023

work page 2023
[13]

Sne-roadsegv2: Advancing heterogeneous feature fusion and fallibility awareness for freespace detec- tion,

Y . Feng et al., “Sne-roadsegv2: Advancing heterogeneous feature fusion and fallibility awareness for freespace detec- tion,”IEEE Transactions on Instrumentation and Measure- ment, vol. 74, pp. 1–9, 2025

work page 2025
[14]

Roadformer: Duplex transformer for rgb-normal semantic road scene parsing,

J. Li, Y . Zhang, P. Yun, G. Zhou, Q. Chen, and R. Fan, “Roadformer: Duplex transformer for rgb-normal semantic road scene parsing,”IEEE Transactions on Intelligent V ehicles, vol. 9, no. 7, pp. 5163–5172, 2024

work page 2024
[15]

Annotation- free curb detection leveraging altitude difference image,

F. Ma, P. Hou, Y . Liu, Y . Liu, M. Liu, and J. Ma, “Annotation- free curb detection leveraging altitude difference image,” in 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2025, pp. 762–768

work page 2025
[16]

Swin transformer: Hierarchical vision trans- former using shifted windows,

Z. Liu et al., “Swin transformer: Hierarchical vision trans- former using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9992–10 002

work page 2021
[17]

Twinlitenet+: An enhanced multi-task segmentation model for autonomous driving,

Q.-H. Che, D.-T. Le, M.-Q. Pham, V .-T. Nguyen, and D.-K. Lam, “Twinlitenet+: An enhanced multi-task segmentation model for autonomous driving,”Computers and Electrical Engineering, vol. 128, p. 110 694, 2025

work page 2025
[18]

Knowledge generation and distillation for road segmentation in intelligent transportation systems,

M. Li, J. Wang, and H. Chen, “Knowledge generation and distillation for road segmentation in intelligent transportation systems,”IEEE Transactions on Intelligent Transportation Systems, 2025

work page 2025
[19]

Lrdnet: Lightweight lidar aided cascaded feature pools for free road space detection,

A. A. Khan, J. Shao, Y . Rao, L. She, and H. T. Shen, “Lrdnet: Lightweight lidar aided cascaded feature pools for free road space detection,”IEEE Transactions on Multimedia, vol. 27, pp. 652–664, 2025

work page 2025
[20]

Fast road segmentation via uncertainty-aware symmetric network,

Y . Chang, F. Xue, F. Sheng, W. Liang, and A. Ming, “Fast road segmentation via uncertainty-aware symmetric network,” in2022 International Conference on Robotics and Automation (ICRA), IEEE, 2022, pp. 11 124–11 130

work page 2022
[21]

Sdfnet for real-time semantic segmenta- tion on urban road images,

Y . Cao and H. Qu, “Sdfnet for real-time semantic segmenta- tion on urban road images,”IAENG International Journal of Computer Science, vol. 52, no. 12, pp. 4815–4821, 2025

work page 2025
[22]

Lcire-net: Lightweight cross-modal infor- mation interaction for road feature extraction from remote sensing images and gps trajectory/lidar,

Y . Duan et al., “Lcire-net: Lightweight cross-modal infor- mation interaction for road feature extraction from remote sensing images and gps trajectory/lidar,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–18, 2025

work page 2025
[23]

Searching for mobilenetv3,

A. Howard et al., “Searching for mobilenetv3,” inProceed- ings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1314–1324

work page 2019
[24]

Road detection based on illuminant invariance,

J. M. Alvarez and A. M. Lopez, “Road detection based on illuminant invariance,”IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 1, pp. 184–193, 2011

work page 2011
[25]

Fully convolutional networks for semantic segmentation,

J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2015, pp. 3431–3440

work page 2015
[26]

U-net: Convolutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inInternational Conference on Medical Image Computing and Computer- Assisted Intervention, Springer, 2015, pp. 234–241

work page 2015
[27]

Early fusion of camera and lidar for robust road detection based on u-net fcn,

F. Wulff, B. Schaufele, O. Sawade, D. Becker, B. Henke, and I. Radusch, “Early fusion of camera and lidar for robust road detection based on u-net fcn,” in2018 IEEE Intelligent V ehicles Symposium (IV), IEEE, 2018, pp. 1426–1431

work page 2018
[28]

Cross-view transformers for real-time map-view semantic segmentation,

B. Zhou and P. Krahenbuhl, “Cross-view transformers for real-time map-view semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 750–13 759

work page 2022
[29]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

A. G. Howard et al., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,”arXiv preprint arXiv:1704.04861, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Shufflenet v2: Practical guidelines for efficient cnn architecture design,

N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 122–138

work page 2018
[31]

Espnet: Efficient spatial pyramid of dilated con- volutions for semantic segmentation,

S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Ha- jishirzi, “Espnet: Efficient spatial pyramid of dilated con- volutions for semantic segmentation,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 561–580

work page 2018
[32]

Bisenet: Bilateral segmentation network for real-time seman- tic segmentation,

C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time seman- tic segmentation,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 334–349

work page 2018
[33]

Eca- net: Efficient channel attention for deep convolutional neural networks,

Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca- net: Efficient channel attention for deep convolutional neural networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 531– 11 539

work page 2020
[34]

Coordinate attention for effi- cient mobile network design,

Q. Hou, D. Zhou, and J. Feng, “Coordinate attention for effi- cient mobile network design,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 708–13 717

work page 2021
[35]

Internimage: Exploring large-scale vision foundation models with deformable convolutions,

W. Wang et al., “Internimage: Exploring large-scale vision foundation models with deformable convolutions,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 408–14 419

work page 2023
[36]

The lov ´asz- softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,

M. Berman, A. R. Triki, and M. B. Blaschko, “The lov ´asz- softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4413–4421

work page 2018
[37]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2999–3007

work page 2017
[38]

Vision meets robotics: The kitti dataset,

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013

work page 2013

[1] [1]

Deep learning-based vehicle behavior prediction for autonomous driving applications: A review,

S. Mozaffari, O. Al-Jarrah, M. Dianati, P. Jennings, and A. Mouzakitis, “Deep learning-based vehicle behavior prediction for autonomous driving applications: A review,”IEEE Trans- actions on Intelligent Transportation Systems, vol. 23, no. 1, pp. 33–47, 2022

work page 2022

[2] [2]

Rod: Rgb-only fast and efficient off-road freespace detection,

T. Sun et al., “Rod: Rgb-only fast and efficient off-road freespace detection,” in2025 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2025, pp. 9787– 9793

work page 2025

[3] [3]

Rangenet++: Fast and accurate lidar semantic segmentation,

A. Milioto, I. Vizzo, J. Behley, and C. Stachniss, “Rangenet++: Fast and accurate lidar semantic segmentation,” in2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), IEEE, 2019, pp. 4213–4220

work page 2019

[4] [4]

Sne-roadseg: Incorpo- rating surface normal information into semantic segmentation for accurate freespace detection,

R. Fan, H. Wang, P. Cai, and M. Liu, “Sne-roadseg: Incorpo- rating surface normal information into semantic segmentation for accurate freespace detection,” inEuropean Conference on Computer Vision, Springer, 2020, pp. 340–356

work page 2020

[5] [5]

Progressive LiDAR adaptation for road detection,

Z. Chen, J. Zhang, and D. Tao, “Progressive LiDAR adaptation for road detection,”IEEE/CAA Journal of Automatica Sinica, vol. 6, no. 3, pp. 693–702, 2019

work page 2019

[6] [6]

Orfd: A dataset and benchmark for off- road freespace detection,

C. Min et al., “Orfd: A dataset and benchmark for off- road freespace detection,” in2022 international conference on robotics and automation (ICRA), IEEE, 2022, pp. 2532–2538

work page 2022

[7] [7]

Curbnet: Curb detection framework based on lidar point cloud seg- mentation,

G. Zhao, F. Ma, W. Qi, Y . Liu, M. Liu, and J. Ma, “Curbnet: Curb detection framework based on lidar point cloud seg- mentation,”IEEE Transactions on Intelligent Transportation Systems, 2025

work page 2025

[8] [8]

Self- supervised drivable area segmentation using lidar’s depth information for autonomous driving,

F. Ma, Y . Liu, S. Wang, J. Wu, W. Qi, and M. Liu, “Self- supervised drivable area segmentation using lidar’s depth information for autonomous driving,” in2023 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS), IEEE, 2023, pp. 41–48

work page 2023

[9] [9]

Annotation-free detection of drivable areas and curbs leveraging lidar point cloud maps,

F. Ma, D. Peng, and J. Ma, “Annotation-free detection of drivable areas and curbs leveraging lidar point cloud maps,” arXiv preprint arXiv:2603.27553, 2026

work page arXiv 2026

[10] [10]

Pidnet: A real-time semantic segmentation network inspired by pid controllers,

J. Xu, Z. Xiong, and S. P. Bhattacharyya, “Pidnet: A real-time semantic segmentation network inspired by pid controllers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 529–19 539

work page 2023

[11] [11]

Lovon: Legged open-vocabulary object navigator,

D. Peng, J. Cao, Q. Zhang, and J. Ma, “Lovon: Legged open-vocabulary object navigator,”arXiv preprint arXiv:2507.06747, 2025

work page arXiv 2025

[12] [12]

Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers,

J. Zhang, H. Liu, K. Yang, X. Hu, R. Liu, and R. Stiefelhagen, “Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers,”IEEE Transactions on Intelligent Trans- portation Systems, vol. 24, no. 12, pp. 14 679–14 694, 2023

work page 2023

[13] [13]

Sne-roadsegv2: Advancing heterogeneous feature fusion and fallibility awareness for freespace detec- tion,

Y . Feng et al., “Sne-roadsegv2: Advancing heterogeneous feature fusion and fallibility awareness for freespace detec- tion,”IEEE Transactions on Instrumentation and Measure- ment, vol. 74, pp. 1–9, 2025

work page 2025

[14] [14]

Roadformer: Duplex transformer for rgb-normal semantic road scene parsing,

J. Li, Y . Zhang, P. Yun, G. Zhou, Q. Chen, and R. Fan, “Roadformer: Duplex transformer for rgb-normal semantic road scene parsing,”IEEE Transactions on Intelligent V ehicles, vol. 9, no. 7, pp. 5163–5172, 2024

work page 2024

[15] [15]

Annotation- free curb detection leveraging altitude difference image,

F. Ma, P. Hou, Y . Liu, Y . Liu, M. Liu, and J. Ma, “Annotation- free curb detection leveraging altitude difference image,” in 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2025, pp. 762–768

work page 2025

[16] [16]

Swin transformer: Hierarchical vision trans- former using shifted windows,

Z. Liu et al., “Swin transformer: Hierarchical vision trans- former using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9992–10 002

work page 2021

[17] [17]

Twinlitenet+: An enhanced multi-task segmentation model for autonomous driving,

Q.-H. Che, D.-T. Le, M.-Q. Pham, V .-T. Nguyen, and D.-K. Lam, “Twinlitenet+: An enhanced multi-task segmentation model for autonomous driving,”Computers and Electrical Engineering, vol. 128, p. 110 694, 2025

work page 2025

[18] [18]

Knowledge generation and distillation for road segmentation in intelligent transportation systems,

M. Li, J. Wang, and H. Chen, “Knowledge generation and distillation for road segmentation in intelligent transportation systems,”IEEE Transactions on Intelligent Transportation Systems, 2025

work page 2025

[19] [19]

Lrdnet: Lightweight lidar aided cascaded feature pools for free road space detection,

A. A. Khan, J. Shao, Y . Rao, L. She, and H. T. Shen, “Lrdnet: Lightweight lidar aided cascaded feature pools for free road space detection,”IEEE Transactions on Multimedia, vol. 27, pp. 652–664, 2025

work page 2025

[20] [20]

Fast road segmentation via uncertainty-aware symmetric network,

Y . Chang, F. Xue, F. Sheng, W. Liang, and A. Ming, “Fast road segmentation via uncertainty-aware symmetric network,” in2022 International Conference on Robotics and Automation (ICRA), IEEE, 2022, pp. 11 124–11 130

work page 2022

[21] [21]

Sdfnet for real-time semantic segmenta- tion on urban road images,

Y . Cao and H. Qu, “Sdfnet for real-time semantic segmenta- tion on urban road images,”IAENG International Journal of Computer Science, vol. 52, no. 12, pp. 4815–4821, 2025

work page 2025

[22] [22]

Lcire-net: Lightweight cross-modal infor- mation interaction for road feature extraction from remote sensing images and gps trajectory/lidar,

Y . Duan et al., “Lcire-net: Lightweight cross-modal infor- mation interaction for road feature extraction from remote sensing images and gps trajectory/lidar,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–18, 2025

work page 2025

[23] [23]

Searching for mobilenetv3,

A. Howard et al., “Searching for mobilenetv3,” inProceed- ings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1314–1324

work page 2019

[24] [24]

Road detection based on illuminant invariance,

J. M. Alvarez and A. M. Lopez, “Road detection based on illuminant invariance,”IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 1, pp. 184–193, 2011

work page 2011

[25] [25]

Fully convolutional networks for semantic segmentation,

J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2015, pp. 3431–3440

work page 2015

[26] [26]

U-net: Convolutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inInternational Conference on Medical Image Computing and Computer- Assisted Intervention, Springer, 2015, pp. 234–241

work page 2015

[27] [27]

Early fusion of camera and lidar for robust road detection based on u-net fcn,

F. Wulff, B. Schaufele, O. Sawade, D. Becker, B. Henke, and I. Radusch, “Early fusion of camera and lidar for robust road detection based on u-net fcn,” in2018 IEEE Intelligent V ehicles Symposium (IV), IEEE, 2018, pp. 1426–1431

work page 2018

[28] [28]

Cross-view transformers for real-time map-view semantic segmentation,

B. Zhou and P. Krahenbuhl, “Cross-view transformers for real-time map-view semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 750–13 759

work page 2022

[29] [29]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

A. G. Howard et al., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,”arXiv preprint arXiv:1704.04861, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [30]

Shufflenet v2: Practical guidelines for efficient cnn architecture design,

N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 122–138

work page 2018

[31] [31]

Espnet: Efficient spatial pyramid of dilated con- volutions for semantic segmentation,

S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Ha- jishirzi, “Espnet: Efficient spatial pyramid of dilated con- volutions for semantic segmentation,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 561–580

work page 2018

[32] [32]

Bisenet: Bilateral segmentation network for real-time seman- tic segmentation,

C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time seman- tic segmentation,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 334–349

work page 2018

[33] [33]

Eca- net: Efficient channel attention for deep convolutional neural networks,

Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca- net: Efficient channel attention for deep convolutional neural networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 531– 11 539

work page 2020

[34] [34]

Coordinate attention for effi- cient mobile network design,

Q. Hou, D. Zhou, and J. Feng, “Coordinate attention for effi- cient mobile network design,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 708–13 717

work page 2021

[35] [35]

Internimage: Exploring large-scale vision foundation models with deformable convolutions,

W. Wang et al., “Internimage: Exploring large-scale vision foundation models with deformable convolutions,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 408–14 419

work page 2023

[36] [36]

The lov ´asz- softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,

M. Berman, A. R. Triki, and M. B. Blaschko, “The lov ´asz- softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4413–4421

work page 2018

[37] [37]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2999–3007

work page 2017

[38] [38]

Vision meets robotics: The kitti dataset,

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013

work page 2013