pith. machine review for the scientific record. sign in

arxiv: 2604.05405 · v1 · submitted 2026-04-07 · 💻 cs.CV

Recognition: no theorem link

Weather-Conditioned Branch Routing for Robust LiDAR-Radar 3D Object Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords weather-conditioned routingLiDAR-radar fusion3D object detectionadverse weather robustnessadaptive multi-modal fusionbranch routingcondition tokenK-Radar dataset
0
0 comments X

The pith

A weather-conditioned router dynamically weights pure LiDAR, pure radar, and fusion branches to adapt 3D detection to changing conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multi-modal 3D object detection can be improved by treating it as a weather-conditioned branch routing problem rather than fixed fusion. It maintains three separate feature streams and uses a condition token to let a router assign per-sample weights for aggregation. An auxiliary weather classification task plus diversity regularization keeps the branches from collapsing and makes the routing interpretable. This yields state-of-the-art results on the K-Radar dataset and transparent views of sensor preferences across weather types. Readers would care because real-world autonomous systems need reliable perception when sensors degrade unevenly in rain, fog, or snow.

Core claim

The central discovery is that reformulating perception as weather-conditioned branch routing, with parallel LiDAR, 4D radar, and condition-gated fusion streams aggregated via a lightweight router driven by a condition token from prompts, plus weather-supervised auxiliary classification and diversity regularization, enables robust adaptation and explicit insights into modality shifts without branch collapse, outperforming prior fixed or weakly adaptive fusion methods.

What carries the argument

The condition-gated router that predicts sample-specific weights for the three parallel 3D feature streams using a condition token extracted from visual and semantic prompts.

Load-bearing premise

A condition token derived from visual and semantic prompts suffices for a lightweight router to predict effective sample-specific weights that avoid branch collapse when combined with weather supervision.

What would settle it

Running the model on K-Radar test scenes from heavy fog where the router fails to increase radar branch weight relative to LiDAR, resulting in no accuracy gain over a static fusion baseline.

Figures

Figures reproduced from arXiv: 2604.05405 by Hongsheng Li, Liang Li, Lingfeng Zhang, Rong Yin, Wenbo Ding, Xiaoshuai Hao, Zexian Yang.

Figure 1
Figure 1. Figure 1: Comparison between existing LiDAR-4D radar fusion methods and our proposed approach. Existing methods rely on [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework overview of the proposed weather-conditioned branch routing method. The model takes LiDAR, 4D radar, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results and interpretable routing behav [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Robust 3D object detection in adverse weather is highly challenging due to the varying reliability of different sensors. While existing LiDAR-4D radar fusion methods improve robustness, they predominantly rely on fixed or weakly adaptive pipelines, failing to dy-namically adjust modality preferences as environmental conditions change. To bridge this gap, we reformulate multi-modal perception as a weather-conditioned branch routing problem. Instead of computing a single fused output, our framework explicitly maintains three parallel 3D feature streams: a pure LiDAR branch, a pure 4D radar branch, and a condition-gated fusion branch. Guided by a condition token extracted from visual and semantic prompts, a lightweight router dynamically predicts sample-specific weights to softly aggregate these representations. Furthermore, to prevent branch collapse, we introduce a weather-supervised learning strategy with auxiliary classification and diversity regularization to enforce distinct, condition-dependent routing behaviors. Extensive experiments on the K-Radar benchmark demonstrate that our method achieves state-of-the-art performance. Furthermore, it provides explicit and highly interpretable insights into modality preferences, transparently revealing how adaptive routing robustly shifts reliance between LiDAR and 4D radar across diverse adverse-weather scenarios. The source code with be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes reformulating multi-modal 3D object detection as a weather-conditioned branch routing task for LiDAR and 4D radar fusion in adverse conditions. It maintains three parallel streams (pure LiDAR, pure 4D radar, condition-gated fusion) whose outputs are softly aggregated by sample-specific weights predicted by a lightweight router. The router is driven by a condition token extracted from visual and semantic prompts; weather-supervised auxiliary classification and diversity regularization are added to enforce distinct, non-collapsing routing behaviors. Experiments on the K-Radar benchmark are said to yield state-of-the-art detection performance together with interpretable modality-preference shifts across weather regimes.

Significance. If the reported gains and the claimed interpretability hold under scrutiny, the explicit three-branch design with auxiliary regularization offers a concrete mechanism for adaptive, transparent modality selection that fixed-fusion baselines lack. The emphasis on preventing branch collapse via diversity losses is a constructive technical choice. However, the overall significance is limited by the absence of visible quantitative support in the abstract and by the unresolved dependence on visual prompts in the target domain.

major comments (2)
  1. [Method (condition token and router)] The central routing mechanism relies on a condition token extracted from visual and semantic prompts (abstract and method description). In the adverse-weather regimes that constitute the target domain, camera images are degraded by rain, fog, or snow; any corruption of this token therefore directly undermines the router's ability to produce meaningful, condition-dependent weights. Because the weather-supervised auxiliary losses and diversity regularization act downstream of token extraction, they cannot retroactively correct an uninformative or weather-agnostic token. The manuscript must demonstrate either that the token remains robust under realistic visual degradation or that an alternative non-visual conditioning path is available.
  2. [Experiments] The abstract asserts state-of-the-art performance on K-Radar yet supplies no numerical results, baseline comparisons, per-weather ablations, or error analysis. Without these data it is impossible to judge whether the routing actually delivers the claimed gains or merely matches existing fusion pipelines. The full paper must include quantitative tables (e.g., mAP, NDS, or recall stratified by weather type) together with ablations that isolate the contribution of the router, the three branches, and the auxiliary losses.
minor comments (2)
  1. [Abstract] Abstract contains the typo 'The source code with be released' (should read 'will be released').
  2. [Abstract] Abstract shows an apparent line-break hyphen: 'dy-namically' should be rendered as 'dynamically'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Method (condition token and router)] The central routing mechanism relies on a condition token extracted from visual and semantic prompts (abstract and method description). In the adverse-weather regimes that constitute the target domain, camera images are degraded by rain, fog, or snow; any corruption of this token therefore directly undermines the router's ability to produce meaningful, condition-dependent weights. Because the weather-supervised auxiliary losses and diversity regularization act downstream of token extraction, they cannot retroactively correct an uninformative or weather-agnostic token. The manuscript must demonstrate either that the token remains robust under realistic visual degradation or that an alternative non-visual conditioning path is available.

    Authors: We appreciate the referee's point on potential degradation of visual prompts. Our condition token combines visual features with semantic prompts that encode weather conditions (e.g., textual or label-based descriptors such as 'heavy rain' or 'fog'), which are independent of camera image quality and can be sourced from external metadata or a lightweight non-visual classifier. We will add a new robustness subsection with experiments that artificially degrade the visual component of the prompts (simulating rain/fog corruption) and quantify the resulting routing stability and detection performance, confirming that the semantic path preserves meaningful condition-dependent weights. revision: yes

  2. Referee: [Experiments] The abstract asserts state-of-the-art performance on K-Radar yet supplies no numerical results, baseline comparisons, per-weather ablations, or error analysis. Without these data it is impossible to judge whether the routing actually delivers the claimed gains or merely matches existing fusion pipelines. The full paper must include quantitative tables (e.g., mAP, NDS, or recall stratified by weather type) together with ablations that isolate the contribution of the router, the three branches, and the auxiliary losses.

    Authors: We agree that explicit quantitative support is necessary for evaluating the claims. The full manuscript already contains Table 1 (overall mAP/NDS vs. baselines on K-Radar), Table 2 (per-weather stratified results), and Table 3 (ablations isolating the router, three branches, and auxiliary losses). We will update the abstract to report the key numerical gains (e.g., overall mAP improvement) and expand the error analysis to discuss the observed modality-preference shifts across weather regimes. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation: learned router with auxiliary losses is standard supervised training

full rationale

The paper describes extracting a condition token from visual/semantic prompts, feeding it to a lightweight router that outputs sample-specific weights for soft aggregation of three parallel branches (pure LiDAR, pure 4D radar, condition-gated fusion), and training the whole system with weather-supervised auxiliary classification plus diversity regularization to avoid collapse. This is a conventional end-to-end neural architecture and loss design; the routing weights are outputs of a learned module, not algebraically defined in terms of themselves, and no performance metric is shown to reduce to a fitted parameter by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided description. The central claims rest on empirical results on the K-Radar benchmark rather than tautological reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that weather can be reliably summarized by a condition token from visual and semantic prompts and that auxiliary losses can enforce distinct routing without additional external validation of the token quality.

axioms (1)
  • domain assumption A condition token extracted from visual and semantic prompts accurately represents weather conditions for routing decisions
    Invoked to guide the lightweight router in predicting sample-specific weights.
invented entities (1)
  • condition-gated fusion branch no independent evidence
    purpose: To provide a dynamic fusion pathway alongside pure sensor branches
    New architectural component introduced to enable weather-adaptive aggregation.

pith-pipeline@v0.9.0 · 5530 in / 1306 out tokens · 35281 ms · 2026-05-10T19:26:29.682655+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 20 canonical work pages · 1 internal anchor

  1. [1]

    Xuyang Bai, Zeyu Hu, Xinge Zhu, Qingqiu Huang, Yilun Chen, Hongbo Fu, and Chiew-Lan Tai. 2022. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1090–1099

  2. [2]

    Mario Bijelic, Tobias Gruber, Fahim Mannan, Florian Kraus, Werner Ritter, Klaus Dietmayer, and Felix Heide. 2020. Seeing through fog without seeing fog: Deep multimodal sensor fusion in unseen adverse weather. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11682–11692

  3. [3]

    Yujeong Chae, Hyeonseong Kim, Chang-Hwan Oh, Minseok Kim, and Kuk-Jin Yoon. 2024. LiDAR-Based All-Weather 3D Object Detection via Prompting and Distilling 4D Radar. InEuropean Conference on Computer Vision

  4. [4]

    Yujeong Chae, Hyeonseong Kim, and Kuk-Jin Yoon. 2024. Towards robust 3d object detection with lidar and 4d radar fusion in various weather conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15162–15172

  5. [5]

    Xuanyao Chen, Tianyuan Zhang, Yue Wang, Yilun Wang, and Hang Zhao. 2023. Futr3d: A unified sensor fusion framework for 3d detection. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition. 172–181

  6. [6]

    Hao Cheng, Erjia Xiao, Yichi Wang, Lingfeng Zhang, Qiang Zhang, Jiahang Cao, Kaidi Xu, Mengshu Sun, Xiaoshuai Hao, Jindong Gu, et al. 2025. Exploring typographic visual prompts injection threats in cross-modality generation models. arXiv preprint arXiv:2503.11519(2025)

  7. [7]

    Anh The Do and Myungsik Yoo. 2022. LossDistillNet: 3D object detection in point cloud under harsh weather conditions.IEEE Access10 (2022), 84882–84893

  8. [8]

    Haoxiang Fu, Lingfeng Zhang, Hao Li, Ruibing Hu, Zhengrong Li, Guanjing Liu, Zimu Tan, Long Chen, Hangjun Ye, and Xiaoshuai Hao. 2026. SEF-MAP: Subspace-Decomposed Expert Fusion for Robust Multimodal HD Map Prediction. arXiv preprint arXiv:2602.21589(2026)

  9. [9]

    Zeying Gong, Rong Li, Tianshuai Hu, Ronghe Qiu, Lingdong Kong, Lingfeng Zhang, Yiyi Ding, Leying Zhang, and Junwei Liang. 2025. Stairway to success: Zero-shot floor-aware object-goal navigation via llm-driven coarse-to-fine explo- ration.arXiv e-prints(2025), arXiv–2505

  10. [10]

    Ben Graham. 2015. Sparse 3D convolutional neural networks.arXiv preprint arXiv:1505.02890(2015)

  11. [11]

    Martin Hahner, Christos Sakaridis, Dengxin Dai, and Luc Van Gool. 2021. Fog Simulation on Real LiDAR Point Clouds for 3D Object Detection in Adverse Weather.2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 15263–15272

  12. [12]

    Xiaoshuai Hao, Yunfeng Diao, Mengchuan Wei, Yifan Yang, Peng Hao, Rong Yin, Hui Zhang, Weiming Li, Shu Zhao, and Yu Liu. 2025. Mapfusion: A novel bev feature fusion network for multi-modal map construction.Information Fusion 119 (2025), 103018

  13. [13]

    Xiaoshuai Hao, Lingdong Kong, Rong Yin, Pengwei Wang, Jing Zhang, Yunfeng Diao, and Shu Zhao. 2025. SafeMap: Robust HD Map Construction from In- complete Observations. InInternational Conference on Machine Learning. PMLR, 22091–22102

  14. [14]

    Xiaoshuai Hao, Ruikai Li, Hui Zhang, Dingzhe Li, Rong Yin, Sangil Jung, Seung-In Park, ByungIn Yoo, Haimei Zhao, and Jing Zhang. 2024. MapDistill: Boosting Efficient Camera-based HD Map Construction via Camera-LiDAR Fusion Model Distillation. InEuropean Conference on Computer Vision

  15. [15]

    Xiaoshuai Hao, Guanqun Liu, Yuting Zhao, Yuheng Ji, Mengchuan Wei, Haimei Zhao, Lingdong Kong, Rong Yin, and Yu Liu. 2025. Msc-bench: Benchmark- ing and analyzing multi-sensor corruption for driving perception. In2025 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6

  16. [16]

    Xiaoshuai Hao, Huaihai Lyu, Lingfeng Zhang, Rui Liu, Dayan Wu, Jing Zhang, and Long Chen. 2026. H2R-BM: Can Leveraging Human Videos Enhance Performance and Generalizability in Robotic Bimanual Manipulation?Pattern Recognition (2026), 113637

  17. [17]

    Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yanbiao Ma, Yunfeng Diao, Ziyu Jia, Wenbo Ding, Hangjun Ye, and Long Chen. 2025. RoboAfford++: A Gen- erative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation.arXiv preprint arXiv:2511.12436(2025)

  18. [18]

    Xiaoshuai Hao, Mengchuan Wei, Yifan Yang, Haimei Zhao, Hui Zhang, Yi Zhou, Qiang Wang, Weiming Li, Lingdong Kong, and Jing Zhang. 2024. Is Your HD Map Constructor Reliable under Sensor Corruptions?. InAdvances in Neural Information Processing System

  19. [19]

    Xiaoshuai Hao, Hui Zhang, Yifan Yang, Yi Zhou, Sangil Jung, Seung-In Park, and ByungIn Yoo. 2024. Mbfusion: A new multi-modal bev feature fusion method for hd map construction. In2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 15922–15928

  20. [20]

    Xiaoshuai Hao, Yuting Zhao, Yuheng Ji, Luanyuan Dai, Peng Hao, Dingzhe Li, Shuai Cheng, and Rong Yin. 2025. What Really Matters for Robust Multi-Sensor HD Map Construction?. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 1298–1304

  21. [21]

    Xiaoshuai Hao, Lei Zhou, et al. 2025. Mimo-embodied: X-embodied foundation model technical report.arXiv preprint arXiv:2511.16518(2025)

  22. [22]

    Xiaoshuai Hao, Yi Zhu, Srikar Appalaraju, Aston Zhang, Wanqian Zhang, Bo Li, and Mu Li. 2023. Mixgen: A new multi-modal data augmentation. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 379–389

  23. [23]

    Tengteng Huang, Zhe Liu, Xiwu Chen, and Xiang Bai. 2020. Epnet: Enhanc- ing point features with image semantics for 3d object detection. InEuropean conference on computer vision. Springer, 35–52

  24. [24]

    Xun Huang, Hai Wu, Xin Li, Xiaoliang Fan, Chenglu Wen, and Cheng Wang

  25. [25]

    InProceedings of the AAAI Conference on Artificial Intelligence, Vol

    Sunshine to rainstorm: Cross-weather knowledge distillation for robust 3d object detection. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 2409–2416

  26. [26]

    Xun Huang, Ziyu Xu, Hai Wu, Jinlong Wang, Qiming Xia, Yan Xia, Jonathan Li, Kyle Gao, Chenglu Wen, and Cheng Wang. 2025. L4dr: Lidar-4dradar fusion for weather-robust 3d object detection. InProceedings of the AAAI conference on artificial intelligence, Vol. 39. 3806–3814

  27. [27]

    Lingdong Kong, You-Chen Liu, Xin Li, Runnan Chen, Wenwei Zhang, Jiawei Ren, Liang Pan, Kaili Chen, and Ziwei Liu. 2023. Robo3D: Towards Robust and Reliable 3D Perception against Corruptions.2023 IEEE/CVF International Conference on Computer Vision (ICCV)(2023), 19937–19949

  28. [28]

    Lingdong Kong, Shaoyuan Xie, Zeying Gong, Ye Li, Meng Chu, Ao Liang, Yuhao Dong, Tianshuai Hu, Ronghe Qiu, Rong Li, et al. 2026. The RoboSense challenge: Sense anything, navigate anywhere, adapt across platforms.arXiv preprint arXiv:2601.05014(2026). Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al

  29. [29]

    Seung-Hyun Kong, Dong-Hee Paek, and Sangjae Cho. 2023. RTNH+: Enhanced 4D radar object detection network using combined CFAR-based two-level pre- processing and vertical encoding.arXiv preprint arXiv:2310.17659(2023)

  30. [30]

    Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Os- car Beijbom. 2019. Pointpillars: Fast encoders for object detection from point clouds. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12697–12705

  31. [31]

    Dasong Li, Sizhuo Ma, Hang Hua, Wenjie Li, Jian Wang, Chris Wei Zhou, Feng- bin Guan, Xin Li, Zihao Yu, Yiting Lu, et al . 2025. Vquala 2025 challenge on engagement prediction for short videos: Methods and results. InProceedings of the IEEE/CVF International Conference on Computer Vision. 3391–3401

  32. [32]

    Xin Li, Tao Ma, Yuenan Hou, Botian Shi, Yuchen Yang, Youquan Liu, Xingjiao Wu, Qin Chen, Yikang Li, Yu Qiao, et al. 2023. Logonet: Towards accurate 3d object detection with local-to-global cross-modal fusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 17524–17534

  33. [33]

    Yu Li, Yuenan Hou, Yingmei Wei, Xinge Zhu, Yuexin Ma, Wenqi Shao, and Yanming Guo. 2025. MoE3D: Mixture of Experts meets Multi-Modal 3D Under- standing.arXiv preprint arXiv:2511.22103(2025)

  34. [34]

    Yingwei Li, Adams Wei Yu, Tianjian Meng, Ben Caine, Jiquan Ngiam, Daiyi Peng, Junyang Shen, Yifeng Lu, Denny Zhou, Quoc V Le, et al. 2022. Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 17182–17191

  35. [35]

    Peiran Liu, Qiang Zhang, Daojie Peng, Lingfeng Zhang, Yihao Qin, Hang Zhou, Jun Ma, Renjing Xu, and Yiding Ji. 2025. Toponav: Topological graphs as a key enabler for advanced object navigation.arXiv preprint arXiv:2509.01364(2025)

  36. [36]

    Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela L Rus, and Song Han. 2023. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In2023 IEEE international conference on robotics and automation (ICRA). IEEE, 2774–2781

  37. [37]

    Dong-Hee Paek, Seung-Hyun Kong, and Kevin Tirta Wijaya. 2022. K-radar: 4d radar object detection for autonomous driving in various weather conditions. Advances in Neural Information Processing Systems35 (2022), 3819–3829

  38. [38]

    Yuan Xiao Qi, Chun Liu, Hangbin Wu, Ruijie Chen, Chenglu Wen, Xun Huang, Shoujun Jia, and Keke Zhang. 2026. FusionBev: LiDAR and 4D radar fusion for 3D object detection.Inf. Fusion132 (2026), 104240

  39. [39]

    Kun Qian, Shilin Zhu, Xinyu Zhang, and Li Erran Li. 2021. Robust Multimodal Vehicle Detection in Foggy Weather Using Complementary Lidar and Radar Signals. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 444–453

  40. [40]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  41. [41]

    Hao Shan, Ruikai Li, Han Jiang, et al. 2025. Stability under scrutiny: Benchmarking representation paradigms for online hd mapping.arXiv preprint arXiv:2510.10660 (2025)

  42. [42]

    Vishwanath A Sindagi, Yin Zhou, and Oncel Tuzel. 2019. Mvx-net: Multimodal voxelnet for 3d object detection. In2019 International Conference on Robotics and Automation (ICRA). IEEE, 7276–7282

  43. [43]

    Jingyu Song, Lingjun Zhao, and Katherine A. Skinner. 2024. LiRaFusion: Deep Adaptive LiDAR-Radar Fusion for 3D Object Detection. In2024 IEEE International Conference on Robotics and Automation (ICRA). 18250–18257

  44. [44]

    Ziying Song, Lin Liu, Feiyang Jia, Yadan Luo, Caiyan Jia, Guoxin Zhang, Lei Yang, and Li Wang. 2024. Robustness-aware 3d object detection in autonomous driving: A review and outlook.IEEE Transactions on Intelligent Transportation Systems25, 11 (2024), 15407–15436

  45. [45]

    Yingbo Tang, Lingfeng Zhang, Shuyi Zhang, Yinuo Zhao, and Xiaoshuai Hao. 2025. Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation. InProceedings of the 33rd ACM International Conference on Multimedia. 12706–12713

  46. [46]

    Sourabh Vora, Alex H Lang, Bassam Helou, and Oscar Beijbom. 2020. Pointpaint- ing: Sequential fusion for 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4604–4612

  47. [47]

    Li Wang, Xinyu Zhang, Baowei Xv, Jinzhao Zhang, Rong Fu, Xiaoyu Wang, Lei Zhu, Haibing Ren, Pingping Lu, Jun Li, et al. 2022. InterFusion: Interaction-based 4D radar and LiDAR fusion for 3D object detection. In2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 12247–12253

  48. [48]

    Yingjie Wang, Jiajun Deng, Yao Li, Jinshui Hu, Cong Liu, Yu Zhang, Jianmin Ji, Wanli Ouyang, and Yanyong Zhang. 2023. Bi-lrfusion: Bi-directional lidar-radar fusion for 3d dynamic object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13394–13403

  49. [49]

    Yan Wang, Junbo Yin, Wei Li, Pascal Frossard, Ruigang Yang, and Jianbing Shen

  50. [50]

    InProceedings of the AAAI Conference on Artificial Intelligence, Vol

    Ssda3d: Semi-supervised domain adaptation for 3d object detection from point cloud. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2707–2715

  51. [51]

    Philipp Wolters, Johannes Gilg, Torben Teepe, Fabian Herzog, Anouar Laouichi, Martin Hofmann, and Gerhard Rigoll. 2025. Unleashing HyDRa: Hybrid Fusion, Depth Consistency and Radar for Unified 3D Perception. In2025 IEEE International Conference on Robotics and Automation (ICRA). 7467–7474

  52. [52]

    Hongjing Wu, Cheng Chi, Jinlin Wu, Yanzhao Su, Zhen Lei, and Wenqi Ren. 2026. UniDA3D: A Unified Domain-Adaptive Framework for Multi-View 3D Object Detection. arXiv:2603.27995 [cs.CV]

  53. [53]

    Hai Wu, Chenglu Wen, Shaoshuai Shi, Xin Li, and Cheng Wang. 2023. Virtual sparse convolution for multimodal 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 21653–21662

  54. [54]

    Yujie Wu, Huaihai Lyu, Yingbo Tang, Lingfeng Zhang, Zhihui Zhang, Wei Zhou, and Siqi Hao. 2025. Evaluating GPT-4o’s Embodied Intelligence: A Comprehensive Empirical Study.Authorea Preprints(2025)

  55. [55]

    Zizhang Wu, Guilian Chen, Yuanzhu Gan, Lei Wang, and Jian Pu. 2023. Mvfusion: Multi-view 3d object detection with semantic-aligned radar and camera fusion. arXiv preprint arXiv:2302.10511(2023)

  56. [56]

    Qiming Xia, Wei Ye, Hai Wu, Shijia Zhao, Leyuan Xing, Xun Huang, Jinhao Deng, Xin Li, Chenglu Wen, and Cheng Wang. 2024. Hinted: Hard instance enhanced detector with mixed-density feature fusion for sparsely-supervised 3d object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15321–15330

  57. [57]

    Erjia Xiao, Lingfeng Zhang, Yingbo Tang, Hao Cheng, Renjing Xu, Wenbo Ding, Lei Zhou, Long Chen, Hangjun Ye, and Xiaoshuai Hao. 2025. Team Xiaomi EV-AD VLA: Learning to Navigate Socially Through Proactive Risk Perception– Technical Report for IROS 2025 RoboSense Challenge Social Navigation Track. arXiv e-prints(2025), arXiv–2510

  58. [58]

    Weiyi Xiong, Jianan Liu, Tao Huang, Qing-Long Han, Yuxuan Xia, and Bing Zhu

  59. [59]

    LXL: LiDAR excluded lean 3D object detection with 4D imaging radar and camera fusion.IEEE Transactions on Intelligent Vehicles9, 1 (2023), 79–92

  60. [60]

    Qiangeng Xu, Yin Zhou, Weiyue Wang, Charles R Qi, and Dragomir Anguelov

  61. [61]

    InProceedings of the IEEE/CVF international conference on computer vision

    Spg: Unsupervised domain adaptation for 3d object detection via semantic point generation. InProceedings of the IEEE/CVF international conference on computer vision. 15446–15456

  62. [62]

    Junjie Yan, Yingfei Liu, Jianjian Sun, Fan Jia, Shuailin Li, Tiancai Wang, and Xiangyu Zhang. 2023. Cross modal transformer via coordinates encoding for 3d object dectection.arXiv preprint arXiv:2301.012832, 3 (2023), 4

  63. [63]

    Jihan Yang, Shaoshuai Shi, Zhe Wang, Hongsheng Li, and Xiaojuan Qi. 2021. ST3D: Self-training for Unsupervised Domain Adaptation on 3D Object Detection. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10363–10373

  64. [64]

    Junbo Yin, Jianbing Shen, Runnan Chen, Wei Li, Ruigang Yang, Pascal Frossard, and Wenguan Wang. 2024. Is-fusion: Instance-scene collaborative fusion for multimodal 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14905–14915

  65. [65]

    Lingfeng Zhang, Haoxiang Fu, Xiaoshuai Hao, Shuyi Zhang, Qiang Zhang, Rui Liu, Long Chen, and Wenbo Ding. 2026. What You See is What You Reach: Towards Spatial Navigation with High-Level Human Instructions. (2026)

  66. [66]

    nava3: Understanding any instruction, navigating anywhere, finding anything

    Lingfeng Zhang, Xiaoshuai Hao, Yingbo Tang, Haoxiang Fu, Xinyu Zheng, Peng- wei Wang, Zhongyuan Wang, Wenbo Ding, and Shanghang Zhang. 2025.𝑁 𝑎𝑣𝐴3: Understanding Any Instruction, Navigating Anywhere, Finding Anything.arXiv preprint arXiv:2508.04598(2025)

  67. [67]

    Lingfeng Zhang, Xiaoshuai Hao, Qinwen Xu, Qiang Zhang, Xinyao Zhang, Peng- wei Wang, Jing Zhang, Zhongyuan Wang, Shanghang Zhang, and Renjing Xu

  68. [68]

    InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 13032–13056

  69. [69]

    Lingfeng Zhang, Hao Wang, Erjia Xiao, Xinyao Zhang, Qiang Zhang, Zixuan Jiang, and Renjing Xu. 2025. Multi-floor zero-shot object navigation policy. In 2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 6416–6422

  70. [70]

    Lingfeng Zhang, Erjia Xiao, Xiaoshuai Hao, Haoxiang Fu, Zeying Gong, Long Chen, Xiaojun Liang, Renjing Xu, Hangjun Ye, and Wenbo Ding. 2025. SocialNav- Map: Dynamic Mapping with Human Trajectory Prediction for Zero-Shot Social Navigation.arXiv preprint arXiv:2511.12232(2025)

  71. [71]

    Lingfeng Zhang, Erjia Xiao, Yuchen Zhang, Haoxiang Fu, Ruibin Hu, Yanbiao Ma, Wenbo Ding, Long Chen, Hangjun Ye, and Xiaoshuai Hao. 2025. Team Xiaomi EV-AD VLA: Caption-Guided Retrieval System for Cross-Modal Drone Navigation–Technical Report for IROS 2025 RoboSense Challenge Track 4.arXiv preprint arXiv:2510.02728(2025)

  72. [72]

    Lingfeng Zhang, Qiang Zhang, Hao Wang, Erjia Xiao, Zixuan Jiang, Honglei Chen, and Renjing Xu. 2024. Trihelper: Zero-shot object navigation with dynamic assistance. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 10035–10042

  73. [73]

    Lingfeng Zhang, Yuchen Zhang, Hongsheng Li, Haoxiang Fu, Yingbo Tang, Hangjun Ye, Long Chen, Xiaojun Liang, Xiaoshuai Hao, and Wenbo Ding. 2025. Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation.arXiv preprint arXiv:2511.13269(2025)

  74. [74]

    Qiang Zhang, Gang Han, Jingkai Sun, Wen Zhao, Jiahang Cao, Jiaxu Wang, Hao Cheng, Lingfeng Zhang, Yijie Guo, and Renjing Xu. 2025. Lips: Large-scale Weather-Conditioned Branch Routing for Robust LiDAR-Radar 3D Object Detection Conference acronym ’XX, June 03–05, 2018, Woodstock, NY humanoid robot reinforcement learning with parallel-series structures.arXi...

  75. [75]

    Qiang Zhang, Peiran Ma, Jiahao @inproceedingshao2024mbfusion, ti- tle=Mbfusion: A new multi-modal bev feature fusion method for hd map construction, author=Hao, Xiaoshuai and Zhang, Hui and Yang, Yifan and Zhou, Yi and Jung, Sangil and Park, Seung-In and Yoo, ByungIn, booktitle=2024 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages=...

  76. [76]

    Qiang Zhang, Zhang Zhang, Wei Cui, Jingkai Sun, Jiahang Cao, Yijie Guo, Gang Han, Wen Zhao, Jiaxu Wang, Chenghao Sun, et al. 2025. Humanoidpano: Hybrid spherical panoramic-lidar cross-modal perception for humanoid robots.arXiv preprint arXiv:2503.09010(2025)

  77. [77]

    Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, and Shanghang Zhang. 2025. Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of-thought. InProceedings of the 33rd ACM International Conference on Multimedia. 12745–12752

  78. [78]

    Haocheng Zhao, Runwei Guan, Taoyu Wu, Ka Lok Man, Limin Yu, and Yutao Yue. 2025. Unibevfusion: Unified radar-vision bevfusion for 3d object detection. In2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 6321–6327

  79. [79]

    Xinyu Zheng, Yangfan He, Yuhao Luo, Lingfeng Zhang, Jianhui Wang, Tianyu Shi, and Yun Bai. 2025. Railway side slope hazard detection system based on generative models.IEEE Sensors Journal(2025)