pith. machine review for the scientific record. sign in

arxiv: 2604.27499 · v1 · submitted 2026-04-30 · 💻 cs.CV

Recognition: unknown

Towards All-Day Perception for Off-Road Driving: A Large-Scale Multispectral Dataset and Comprehensive Benchmark

Authors on Pith no claims yet

Pith reviewed 2026-05-07 08:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords off-road drivinginfrared datasetfreespace detectiontemporal segmentationmemory attentionautonomous vehiclesmultispectral perception
0
0 comments X

The pith

A memory-attention network trained on a new large infrared off-road dataset improves freespace detection accuracy by over 1% while running in real time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the IRON dataset, the first large-scale collection of 24,314 densely annotated infrared images paired with RGB for off-road freespace detection across day and night. It introduces IRONet, a flow-free temporal framework that aggregates historical context through a memory-attention mechanism to resolve inconsistencies between frames that plague single-frame methods. On the IRON benchmark this yields state-of-the-art IoU and F1 scores at real-time speeds. The same model also transfers directly to RGB images on existing off-road benchmarks, supporting more reliable all-day perception where visible light is unreliable.

Core claim

On the IRON dataset of 24,314 densely annotated infrared images with synchronized RGB, the IRONet model using memory attention and a mask decoder reaches 82.93% IoU and 90.66% F1 score for freespace detection, outperforming previous methods by 1.19% IoU and 0.71% F1 at real-time inference speeds. IRONet further shows strong generalization when applied to RGB images on the ORFD and Rellis datasets.

What carries the argument

The memory-attention mechanism in IRONet that aggregates historical context from previous frames to enforce temporal consistency in freespace segmentation without optical flow.

If this is right

  • Temporal consistency can be added to single-frame perception models without the cost of optical-flow computation.
  • Infrared perception becomes practical for nighttime off-road autonomous driving.
  • The IRON dataset enables further development of multispectral methods for unstructured environments.
  • The memory-attention approach transfers across modalities without retraining from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same temporal aggregation could be tested on longer sequences or fused with additional sensors to handle rapid terrain changes.
  • Methods tuned on this off-road infrared data may reveal weaknesses in models originally designed for structured on-road scenes.
  • Extending the framework to infrared object detection or depth estimation could produce similar consistency gains.

Load-bearing premise

That the reported accuracy gains arise chiefly from the memory-attention design rather than from dataset annotation quality, training choices, or the particular scenes used in the test split.

What would settle it

An independently collected infrared off-road dataset with different terrain and lighting where IRONet shows no improvement over single-frame baselines on the same metrics.

Figures

Figures reproduced from arXiv: 2604.27499 by Chen Min, Jilin Mei, Shuai Wang, Shuo Wang, Wenfei Guan, Yan Xing, Yu Hu.

Figure 1
Figure 1. Figure 1: Comparison of RGB and IR perception under nighttime view at source ↗
Figure 2
Figure 2. Figure 2: Data Processing and Annotation Pipeline. view at source ↗
Figure 3
Figure 3. Figure 3: Samples from our IRON dataset. Each column shows a different scene, illustrating the diversity of environments and view at source ↗
Figure 4
Figure 4. Figure 4: Overview of our proposed IRONet architecture. SGMC and HIMG represent the semantic-guided memory compensation view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of IRONet against state-of-the-art methods on a representative sequence from the IRON test set. view at source ↗
read the original abstract

Off-road nighttime autonomous driving suffers from unreliable visible-light perception, making infrared modality crucial for accurate freespace detection. However, progress remains limited due to the scarcity of annotated infrared off-road datasets and the inter-frame inconsistencies inherent to current single-frame methods. To address these gaps, we present the IRON dataset, which, to our knowledge, is the first large-scale infrared dataset for off-road temporal freespace detection under all-day conditions, with strong support for nighttime perception. The dataset comprises 24,314 densely annotated infrared images with synchronized RGB images in diverse scenes and different light conditions. Building upon this dataset, we propose IRONet, a novel flow-free framework for temporal freespace detection that addresses inter-frame inconsistencies by aggregating historical context via a memory-attention mechanism and a carefully designed mask decoder. On our IRON dataset, IRONet achieves state-of-the-art performance, reaching 82.93%(+1.19%) IoU and 90.66%(+0.71%) F1 score at real-time inference. Remarkably, IRONet also exhibits robust generalization to RGB modalities on ORFD and Rellis datasets. Overall, our work establishes a foundation for reliable all-day off-road autonomous driving and future research in infrared temporal perception. The code and IRON dataset are available at https://github.com/wsnbws/IRON.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the IRON dataset (24,314 densely annotated infrared images paired with RGB, covering diverse off-road scenes and lighting conditions) and proposes IRONet, a flow-free temporal freespace detection network that aggregates historical context via a memory-attention mechanism and a mask decoder. It reports state-of-the-art results on IRON (82.93% IoU and 90.66% F1 at real-time inference) with claimed generalization to RGB on ORFD and Rellis datasets, positioning the work as a foundation for all-day off-road perception.

Significance. If the reported gains hold after proper controls, the work would provide a valuable large-scale infrared benchmark for off-road freespace detection and demonstrate a practical temporal architecture that improves frame-to-frame consistency without optical flow, advancing multispectral perception for autonomous driving in low-light conditions.

major comments (3)
  1. [Experiments] Experiments section: The headline claims of +1.19% IoU and +0.71% F1 over prior methods are presented without exhaustive ablations (e.g., memory-attention removed, single-frame baseline with identical backbone and training schedule, or simple frame-stacking comparator), error bars across runs, or explicit validation protocol details; this leaves open whether gains derive from the architecture or from dataset-specific factors.
  2. [Dataset] Dataset section: The IRON train/test split description does not report temporal non-overlap criteria, scene diversity statistics, or cross-validation to rule out memorization of off-road textures or lighting patterns, which is load-bearing for the generalization claims.
  3. [Experiments] Generalization experiments: The transfer results on ORFD and Rellis lack specification of the protocol (zero-shot inference vs. fine-tuning) and do not include a matched single-frame baseline, weakening the assertion that the memory-attention mechanism drives robust cross-modal performance.
minor comments (2)
  1. [Abstract] Abstract and method overview: The phrase 'flow-free framework' is used without a brief contrast to flow-based alternatives, which could be clarified for readers unfamiliar with the subfield.
  2. [Implementation] The GitHub link is provided, but the manuscript does not include a reproducibility checklist or hyperparameter table, which would aid verification of the real-time claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment point by point below and have made corresponding revisions to the paper.

read point-by-point responses
  1. Referee: Experiments section: The headline claims of +1.19% IoU and +0.71% F1 over prior methods are presented without exhaustive ablations (e.g., memory-attention removed, single-frame baseline with identical backbone and training schedule, or simple frame-stacking comparator), error bars across runs, or explicit validation protocol details; this leaves open whether gains derive from the architecture or from dataset-specific factors.

    Authors: We agree that the original manuscript would be strengthened by these additional controls. In the revised version, we have added an ablation study that removes the memory-attention module, a single-frame baseline using the identical backbone and training schedule, and a simple frame-stacking comparator. We now report error bars computed over three independent runs with different random seeds and provide explicit details on the train/validation/test protocol and hyperparameter settings in the Experiments section. These new results confirm that the reported gains are attributable to the proposed architecture. revision: yes

  2. Referee: Dataset section: The IRON train/test split description does not report temporal non-overlap criteria, scene diversity statistics, or cross-validation to rule out memorization of off-road textures or lighting patterns, which is load-bearing for the generalization claims.

    Authors: We acknowledge the need for greater transparency on the split. The revised Dataset section now explicitly describes the temporal non-overlap criteria (no shared frames or consecutive sequences between train and test), provides scene diversity statistics (number of distinct locations, distribution across daytime/nighttime and weather conditions), and includes a cross-validation experiment that demonstrates consistent performance across different scene partitions, thereby addressing concerns about memorization. revision: yes

  3. Referee: Generalization experiments: The transfer results on ORFD and Rellis lack specification of the protocol (zero-shot inference vs. fine-tuning) and do not include a matched single-frame baseline, weakening the assertion that the memory-attention mechanism drives robust cross-modal performance.

    Authors: We thank the referee for highlighting this omission. The revised manuscript now clearly states that the ORFD and Rellis results were obtained via zero-shot inference with no fine-tuning on the target datasets. We have also added a matched single-frame baseline (same backbone, trained only on IRON) for direct comparison on both datasets, which isolates the contribution of the memory-attention mechanism to the observed cross-modal generalization. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark claims

full rationale

This is an empirical ML paper introducing the IRON dataset and evaluating IRONet on held-out test splits plus transfer to ORFD and Rellis. Reported IoU/F1 metrics are direct measurements from training and inference on those splits, not quantities that reduce by construction to fitted parameters, self-definitions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are presented that loop the performance claims back to the inputs; the central results remain independent empirical observations.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The central claims rest on the accuracy of manual annotations in the new dataset, the assumption that temporal context via attention improves consistency, and standard deep-learning training assumptions. No machine-checked proofs or parameter-free derivations are present.

free parameters (1)
  • model hyperparameters and training schedule
    Standard neural network parameters tuned during development; not enumerated in abstract.
axioms (2)
  • domain assumption Dense manual annotations for freespace are accurate and consistent across frames and lighting conditions
    Required for supervised training and evaluation of the detection task.
  • domain assumption Aggregating historical context via memory attention reduces inter-frame inconsistencies better than single-frame or flow-based alternatives
    Core motivation and design choice for IRONet.
invented entities (2)
  • IRONet architecture no independent evidence
    purpose: Flow-free temporal freespace detection
    New model proposed to address the identified gaps.
  • memory-attention mechanism no independent evidence
    purpose: Aggregating historical context without optical flow
    Key component claimed to solve inter-frame inconsistency.

pith-pipeline@v0.9.0 · 5561 in / 1543 out tokens · 147410 ms · 2026-05-07T08:47:22.586818+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    Deep depth estimation from thermal image,

    U. Shin, J. Park, and I. S. Kweon, “Deep depth estimation from thermal image,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1043–1053

  2. [2]

    Causal mode multiplexer: A novel framework for unbiased multispectral pedestrian detection,

    T. Kim, S. Shin, Y . Yu, H. G. Kim, and Y . M. Ro, “Causal mode multiplexer: A novel framework for unbiased multispectral pedestrian detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 784–26 793

  3. [3]

    Multispectral object detection enhanced by cross-modal information complementary and cosine similarity channel resampling modules,

    J. Jang, C. Park, H. Kim, J. Lee, and J. Paik, “Multispectral object detection enhanced by cross-modal information complementary and cosine similarity channel resampling modules,” in2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025, pp. 9437–9446

  4. [4]

    Infrared image super- resolution: A systematic review and future trends,

    Y . Huang, T. Miyazaki, X. Liu, and S. Omachi, “Infrared image super- resolution: A systematic review and future trends,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2025

  5. [5]

    Unirgb- ir: A unified framework for visible-infrared semantic tasks via adapter tuning,

    M. Yuan, B. Cui, T. Zhao, J. Wang, S. Fu, X. Yang, and X. Wei, “Unirgb- ir: A unified framework for visible-infrared semantic tasks via adapter tuning,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 2409–2418

  6. [6]

    Progressive domain adaptation for thermal infrared tracking,

    Q. Li, K. Tan, D. Yuan, and Q. Liu, “Progressive domain adaptation for thermal infrared tracking,”Electronics, vol. 14, no. 1, p. 162, 2025

  7. [7]

    Rellis-3d dataset: Data, benchmarks and analysis,

    P. Jiang, P. Osteen, M. Wigness, and S. Saripalli, “Rellis-3d dataset: Data, benchmarks and analysis,” in2021 IEEE international conference on robotics and automation (ICRA). IEEE, 2021, pp. 1110–1116

  8. [8]

    Orfd: A dataset and benchmark for off-road freespace detection,

    C. Min, W. Jiang, D. Zhao, J. Xu, L. Xiao, Y . Nie, and B. Dai, “Orfd: A dataset and benchmark for off-road freespace detection,” in2022 international conference on robotics and automation (ICRA). IEEE, 2022, pp. 2532–2538

  9. [9]

    The goose dataset for perception in unstructured environments,

    P. Mortimer, R. Hagmanns, M. Granero, T. Luettel, J. Petereit, and H.-J. Wuensche, “The goose dataset for perception in unstructured environments,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 14 838–14 844

  10. [10]

    Cat: Cavs traversability dataset for off-road autonomous driving,

    S. Sharma, L. Dabbiru, T. Hannis, G. Mason, D. W. Carruth, M. Doude, C. Goodin, C. Hudson, S. Ozier, J. E. Ballet al., “Cat: Cavs traversability dataset for off-road autonomous driving,”IEEE access, vol. 10, pp. 24 759–24 768, 2022

  11. [11]

    M2p2: A multi- modal passive perception dataset for off-road mobility in extreme low- light conditions,

    A. Datar, A. Pokhrel, M. Nazeri, M. B. Rao, C. Pan, Y . Zhang, A. Harrison, M. Wigness, P. R. Osteen, J. Yeet al., “M2p2: A multi- modal passive perception dataset for off-road mobility in extreme low- light conditions,”arXiv preprint arXiv:2410.01105, 2024

  12. [12]

    Video semantic segmentation with inter- frame feature fusion and inner-frame feature refinement,

    J. Zhuang, Z. Wang, and J. Li, “Video semantic segmentation with inter- frame feature fusion and inner-frame feature refinement,”arXiv preprint arXiv:2301.03832, 2023

  13. [13]

    Exploiting temporal state space sharing for video semantic segmentation,

    S. A. S. Hesham, Y . Liu, G. Sun, H. Ding, J. Yang, E. Konukoglu, X. Geng, and X. Jiang, “Exploiting temporal state space sharing for video semantic segmentation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24 211–24 221

  14. [14]

    Sne-roadseg+: Rethinking depth- normal translation and deep supervision for freespace detection,

    H. Wang, R. Fan, P. Cai, and M. Liu, “Sne-roadseg+: Rethinking depth- normal translation and deep supervision for freespace detection,” in2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 1140–1145

  15. [15]

    Bifnet: Bidirectional fusion network for road segmentation,

    H. Li, Y . Chen, Q. Zhang, and D. Zhao, “Bifnet: Bidirectional fusion network for road segmentation,”IEEE transactions on cybernetics, vol. 52, no. 9, pp. 8617–8628, 2021

  16. [16]

    Roadformer: Duplex transformer for rgb-normal semantic road scene parsing,

    J. Li, Y . Zhang, P. Yun, G. Zhou, Q. Chen, and R. Fan, “Roadformer: Duplex transformer for rgb-normal semantic road scene parsing,”IEEE Transactions on Intelligent V ehicles, vol. 9, no. 7, pp. 5163–5172, 2024

  17. [17]

    Rod: Rgb-only fast and efficient off-road freespace detection,

    T. Sun, H. Ye, J. Mei, L. Chen, F. Zhao, L. Zong, and Y . Hu, “Rod: Rgb-only fast and efficient off-road freespace detection,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 9787–9793

  18. [18]

    Mask propagation for efficient video semantic segmentation,

    Y . Weng, M. Han, H. He, M. Li, L. Yao, X. Chang, and B. Zhuang, “Mask propagation for efficient video semantic segmentation,”Advances in Neural Information Processing Systems, vol. 36, pp. 7170–7183, 2023

  19. [19]

    Global motion understanding in large-scale video object segmentation,

    V . Fedynyak, Y . Romanus, O. Dobosevych, I. Babin, and R. Riazantsev, “Global motion understanding in large-scale video object segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3153–3162

  20. [20]

    Petrv2: A unified framework for 3d perception from multi-camera images,

    Y . Liu, J. Yan, F. Jia, S. Li, A. Gao, T. Wang, and X. Zhang, “Petrv2: A unified framework for 3d perception from multi-camera images,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 3262–3272

  21. [21]

    Rgb-d video object segmentation via enhanced multi-store feature memory,

    B. Xu, R. Hou, T. Ren, and G. Wu, “Rgb-d video object segmentation via enhanced multi-store feature memory,” inProceedings of the 2024 International Conference on Multimedia Retrieval, 2024, pp. 1016– 1024

  22. [22]

    Evolve: Event-guided deformable feature transfer and dual-memory refinement for low-light video object segmentation,

    J.-H. Baek, J. Oh, and Y . J. Koh, “Evolve: Event-guided deformable feature transfer and dual-memory refinement for low-light video object segmentation,” inProceedings of the IEEE/CVF International Confer- ence on Computer Vision, 2025, pp. 11 273–11 282

  23. [23]

    The cityscapes dataset for semantic urban scene understanding,

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be- nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213– 3223

  24. [24]

    Vision meets robotics: The kitti dataset,

    A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The international journal of robotics research, vol. 32, no. 11, pp. 1231–1237, 2013

  25. [25]

    A rugd dataset for autonomous navigation and visual perception in unstructured outdoor environments,

    M. Wigness, S. Eum, J. G. Rogers, D. Han, and H. Kwon, “A rugd dataset for autonomous navigation and visual perception in unstructured outdoor environments,” in2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 5000–5007

  26. [26]

    Tartandrive 2.0: More modalities and better infrastructure to further self-supervised learning re- search in off-road driving tasks,

    M. Sivaprakasam, P. Maheshwari, M. G. Castro, S. Triest, M. Nye, S. Willits, A. Saba, W. Wang, and S. Scherer, “Tartandrive 2.0: More modalities and better infrastructure to further self-supervised learning re- search in off-road driving tasks,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 12 606–12 606

  27. [27]

    Advancing off-road autonomous driving: The large- scale orad-3d dataset and comprehensive benchmarks,

    C. Min, J. Mei, H. Zhai, S. Wang, T. Sun, F. Kong, H. Li, F. Mao, F. Liu, S. Wanget al., “Advancing off-road autonomous driving: The large- scale orad-3d dataset and comprehensive benchmarks,”arXiv preprint arXiv:2510.16500, 2025

  28. [28]

    Of- froadsynth open dataset for semantic segmentation using synthetic-data- based weight initialization for autonomous ugv in off-road environ- ments,

    K. Małek, J. Dybała, A. Kordecki, P. Hondra, and K. Kijania, “Of- froadsynth open dataset for semantic segmentation using synthetic-data- based weight initialization for autonomous ugv in off-road environ- ments,”Journal of Intelligent & Robotic Systems, vol. 110, no. 2, p. 76, 2024

  29. [29]

    Kaist multi-spectral day/night data set for autonomous and as- sisted driving,

    Y . Choi, N. Kim, S. Hwang, K. Park, J. S. Yoon, K. An, and I. S. Kweon, “Kaist multi-spectral day/night data set for autonomous and as- sisted driving,”IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 3, pp. 934–948, 2018

  30. [30]

    Llvip: A visible-infrared paired dataset for low-light vision,

    X. Jia, C. Zhu, M. Li, W. Tang, and W. Zhou, “Llvip: A visible-infrared paired dataset for low-light vision,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3496–3504

  31. [31]

    Mfnet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,

    Q. Ha, K. Watanabe, T. Karasawa, Y . Ushiku, and T. Harada, “Mfnet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,” in2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 5108–5115

  32. [32]

    Flir adas dataset,

    FLIR Systems, Inc., “Flir adas dataset,” Dataset, FLIR Systems, Inc., USA, accessed: 2025-10-27. [Online]. Available: https://oem.flir.com/ en-in/solutions/automotive/adas-dataset-form/

  33. [33]

    Target-aware dual adversarial learning and a multi-scenario multi- modality benchmark to fuse infrared and visible for object detection,

    J. Liu, X. Fan, Z. Huang, G. Wu, R. Liu, W. Zhong, and Z. Luo, “Target-aware dual adversarial learning and a multi-scenario multi- modality benchmark to fuse infrared and visible for object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5802–5811

  34. [34]

    U-net: Convolutional networks for biomedical image segmentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inInternational Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241

  35. [35]

    Encoder- decoder with atrous separable convolution for semantic image segmen- tation,

    L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder- decoder with atrous separable convolution for semantic image segmen- tation,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818

  36. [36]

    Segformer: Simple and efficient design for semantic segmentation with transformers,

    E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,”Advances in neural information processing systems, vol. 34, pp. 12 077–12 090, 2021

  37. [37]

    Masked-attention mask transformer for universal image segmentation,

    B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1290–1299

  38. [38]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

  39. [39]

    Sne-roadseg: Incorporating surface normal information into semantic segmentation for accurate freespace detection,

    R. Fan, H. Wang, P. Cai, and M. Liu, “Sne-roadseg: Incorporating surface normal information into semantic segmentation for accurate freespace detection,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 340–356

  40. [40]

    M2f2-net: Multi-modal feature fusion for unstructured off-road freespace detection,

    H. Ye, J. Mei, and Y . Hu, “M2f2-net: Multi-modal feature fusion for unstructured off-road freespace detection,” in2023 IEEE Intelligent V ehicles Symposium (IV). IEEE, 2023, pp. 1–7

  41. [41]

    Raft: Recurrent all-pairs field transforms for optical flow,

    Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” inEuropean conference on computer vision. Springer, 2020, pp. 402–419

  42. [42]

    Flowformer: A transformer architecture for optical flow,

    Z. Huang, X. Shi, C. Zhang, Q. Wang, K. C. Cheung, H. Qin, J. Dai, and H. Li, “Flowformer: A transformer architecture for optical flow,” in European conference on computer vision. Springer, 2022, pp. 668–685

  43. [43]

    SAM 2: Segment Anything in Images and Videos

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024

  44. [44]

    Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree,

    S. Ding, R. Qian, X. Dong, P. Zhang, Y . Zang, Y . Cao, Y . Guo, D. Lin, and J. Wang, “Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 13 614–13 624

  45. [45]

    Samwise: Infusing wisdom in sam2 for text-driven video segmentation,

    C. Cuttano, G. Trivigno, G. Rosi, C. Masone, and G. Averta, “Samwise: Infusing wisdom in sam2 for text-driven video segmentation,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3395–3405

  46. [46]

    Camosam2: Motion-appearance induced auto-refining prompts for video camouflaged object detection,

    X. Zhang, K. Fu, and Q. Zhao, “Camosam2: Motion-appearance induced auto-refining prompts for video camouflaged object detection,”arXiv preprint arXiv:2504.00375, 2025

  47. [47]

    Benchmarking a large- scale fir dataset for on-road pedestrian detection,

    Z. Xu, J. Zhuang, Q. Liu, J. Zhou, and S. Peng, “Benchmarking a large- scale fir dataset for on-road pedestrian detection,”Infrared Physics & Technology, vol. 96, pp. 199–208, 2019

  48. [48]

    Infraparis: A multi-modal and multi-task autonomous driving dataset,

    G. Franchi, M. Hariat, X. Yu, N. Belkhir, A. Manzanera, and D. Filliat, “Infraparis: A multi-modal and multi-task autonomous driving dataset,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 2973–2983

  49. [49]

    Automotive night vision thermal camera – ir-pilot series,

    Ray Vision Technologies, “Automotive night vision thermal camera – ir-pilot series,” Webpage, Ray Vision Technologies, Pakistan, accessed: 2025-11-17. [Online]. Available: https://rayvisionpk.com/ automotive-night-vision-thermal-camera/

  50. [50]

    Seeed studio,

    L. Seeed Technology Co., “Seeed studio,” https://www.seeedstudio.com/, 2025, accessed: 2025-11-28

  51. [51]

    Advanced auto labeling solution with added features,

    W. Wang, “Advanced auto labeling solution with added features,” https: //github.com/CVHub520/X-AnyLabeling, CVHub, 2023

  52. [52]

    Rtfnet: Rgb-thermal fusion network for semantic segmentation of urban scenes,

    Y . Sun, W. Zuo, and M. Liu, “Rtfnet: Rgb-thermal fusion network for semantic segmentation of urban scenes,”IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 2576–2583, 2019

  53. [53]

    Safe robot navigation via multi-modal anomaly detection,

    L. Wellhausen, R. Ranftl, and M. Hutter, “Safe robot navigation via multi-modal anomaly detection,”IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1326–1333, 2020

  54. [54]

    Self-supervised traversability prediction by learning to reconstruct safe terrain,

    R. Schmid, D. Atha, F. Sch ¨oller, S. Dey, S. Fakoorian, K. Otsu, B. Ridge, M. Bjelonic, L. Wellhausen, M. Hutteret al., “Self-supervised traversability prediction by learning to reconstruct safe terrain,” in2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 12 419–12 425

  55. [55]

    Learning off-road terrain traversability with self-supervisions only,

    J. Seo, S. Sim, and I. Shim, “Learning off-road terrain traversability with self-supervisions only,”IEEE Robotics and Automation Letters, vol. 8, no. 8, pp. 4617–4624, 2023

  56. [56]

    Convmae: Masked convolution meets masked autoencoders,

    P. Gao, T. Ma, H. Li, Z. Lin, J. Dai, and Y . Qiao, “Convmae: Masked convolution meets masked autoencoders,”arXiv preprint arXiv:2205.03892, 2022

  57. [57]

    DINOv3

    O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoaet al., “Dinov3,” arXiv preprint arXiv:2508.10104, 2025

  58. [58]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255