arxiv: 2604.27499 · v1 · submitted 2026-04-30 · 💻 cs.CV

Recognition: unknown

Towards All-Day Perception for Off-Road Driving: A Large-Scale Multispectral Dataset and Comprehensive Benchmark

Shuo Wang , Jilin Mei , Wenfei Guan , Shuai Wang , Yan Xing , Chen Min , Yu Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-07 08:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords off-road drivinginfrared datasetfreespace detectiontemporal segmentationmemory attentionautonomous vehiclesmultispectral perception

0 comments

The pith

A memory-attention network trained on a new large infrared off-road dataset improves freespace detection accuracy by over 1% while running in real time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the IRON dataset, the first large-scale collection of 24,314 densely annotated infrared images paired with RGB for off-road freespace detection across day and night. It introduces IRONet, a flow-free temporal framework that aggregates historical context through a memory-attention mechanism to resolve inconsistencies between frames that plague single-frame methods. On the IRON benchmark this yields state-of-the-art IoU and F1 scores at real-time speeds. The same model also transfers directly to RGB images on existing off-road benchmarks, supporting more reliable all-day perception where visible light is unreliable.

Core claim

On the IRON dataset of 24,314 densely annotated infrared images with synchronized RGB, the IRONet model using memory attention and a mask decoder reaches 82.93% IoU and 90.66% F1 score for freespace detection, outperforming previous methods by 1.19% IoU and 0.71% F1 at real-time inference speeds. IRONet further shows strong generalization when applied to RGB images on the ORFD and Rellis datasets.

What carries the argument

The memory-attention mechanism in IRONet that aggregates historical context from previous frames to enforce temporal consistency in freespace segmentation without optical flow.

If this is right

Temporal consistency can be added to single-frame perception models without the cost of optical-flow computation.
Infrared perception becomes practical for nighttime off-road autonomous driving.
The IRON dataset enables further development of multispectral methods for unstructured environments.
The memory-attention approach transfers across modalities without retraining from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same temporal aggregation could be tested on longer sequences or fused with additional sensors to handle rapid terrain changes.
Methods tuned on this off-road infrared data may reveal weaknesses in models originally designed for structured on-road scenes.
Extending the framework to infrared object detection or depth estimation could produce similar consistency gains.

Load-bearing premise

That the reported accuracy gains arise chiefly from the memory-attention design rather than from dataset annotation quality, training choices, or the particular scenes used in the test split.

What would settle it

An independently collected infrared off-road dataset with different terrain and lighting where IRONet shows no improvement over single-frame baselines on the same metrics.

Figures

Figures reproduced from arXiv: 2604.27499 by Chen Min, Jilin Mei, Shuai Wang, Shuo Wang, Wenfei Guan, Yan Xing, Yu Hu.

**Figure 1.** Figure 1: Comparison of RGB and IR perception under nighttime view at source ↗

**Figure 2.** Figure 2: Data Processing and Annotation Pipeline. view at source ↗

**Figure 3.** Figure 3: Samples from our IRON dataset. Each column shows a different scene, illustrating the diversity of environments and view at source ↗

**Figure 4.** Figure 4: Overview of our proposed IRONet architecture. SGMC and HIMG represent the semantic-guided memory compensation view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of IRONet against state-of-the-art methods on a representative sequence from the IRON test set. view at source ↗

read the original abstract

Off-road nighttime autonomous driving suffers from unreliable visible-light perception, making infrared modality crucial for accurate freespace detection. However, progress remains limited due to the scarcity of annotated infrared off-road datasets and the inter-frame inconsistencies inherent to current single-frame methods. To address these gaps, we present the IRON dataset, which, to our knowledge, is the first large-scale infrared dataset for off-road temporal freespace detection under all-day conditions, with strong support for nighttime perception. The dataset comprises 24,314 densely annotated infrared images with synchronized RGB images in diverse scenes and different light conditions. Building upon this dataset, we propose IRONet, a novel flow-free framework for temporal freespace detection that addresses inter-frame inconsistencies by aggregating historical context via a memory-attention mechanism and a carefully designed mask decoder. On our IRON dataset, IRONet achieves state-of-the-art performance, reaching 82.93%(+1.19%) IoU and 90.66%(+0.71%) F1 score at real-time inference. Remarkably, IRONet also exhibits robust generalization to RGB modalities on ORFD and Rellis datasets. Overall, our work establishes a foundation for reliable all-day off-road autonomous driving and future research in infrared temporal perception. The code and IRON dataset are available at https://github.com/wsnbws/IRON.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The IRON dataset is the main value here for off-road IR perception work, but the reported model gains are small and not yet shown to come from the memory-attention design.

read the letter

The punchline is that this paper's real contribution is the IRON dataset for infrared off-road freespace detection. The model improvements are modest and need more evidence to link them to the memory-attention design. The paper introduces the first large-scale IR dataset aimed at temporal freespace detection off-road, with 24k annotated frames across day and night conditions plus synced RGB. That is genuinely new and addresses a practical hole in all-day autonomous driving perception. On top of it they build IRONet, which aggregates history through memory attention instead of optical flow, plus a mask decoder. They report SOTA numbers on their data with real-time inference and some generalization when applied to RGB on other off-road sets. What they do well is release everything and focus on a real deployment constraint—nighttime off-road where visible light fails. The numbers show a consistent but modest lift. The soft spots sit in the experimental section. The gains are only +1.19% IoU and +0.71% F1, and without ablations that turn the memory module off or compare against matched single-frame baselines, it is difficult to know how much the architecture drives the result versus the dataset size and quality. The generalization claim to RGB datasets also needs the protocol spelled out—zero-shot transfer or fine-tuning? Dataset splits should be shown to be temporally disjoint and scene-diverse to rule out easy memorization. This is for people working on perception for unstructured environments and anyone who needs IR benchmarks. A reader who wants data for training temporal models in low light would get direct value from it. The work is solid enough on its own terms to deserve a serious referee, mainly for the dataset contribution. I would send it out for review with a note to strengthen the controls on what produces the reported gains.

Referee Report

3 major / 2 minor

Summary. The paper introduces the IRON dataset (24,314 densely annotated infrared images paired with RGB, covering diverse off-road scenes and lighting conditions) and proposes IRONet, a flow-free temporal freespace detection network that aggregates historical context via a memory-attention mechanism and a mask decoder. It reports state-of-the-art results on IRON (82.93% IoU and 90.66% F1 at real-time inference) with claimed generalization to RGB on ORFD and Rellis datasets, positioning the work as a foundation for all-day off-road perception.

Significance. If the reported gains hold after proper controls, the work would provide a valuable large-scale infrared benchmark for off-road freespace detection and demonstrate a practical temporal architecture that improves frame-to-frame consistency without optical flow, advancing multispectral perception for autonomous driving in low-light conditions.

major comments (3)

[Experiments] Experiments section: The headline claims of +1.19% IoU and +0.71% F1 over prior methods are presented without exhaustive ablations (e.g., memory-attention removed, single-frame baseline with identical backbone and training schedule, or simple frame-stacking comparator), error bars across runs, or explicit validation protocol details; this leaves open whether gains derive from the architecture or from dataset-specific factors.
[Dataset] Dataset section: The IRON train/test split description does not report temporal non-overlap criteria, scene diversity statistics, or cross-validation to rule out memorization of off-road textures or lighting patterns, which is load-bearing for the generalization claims.
[Experiments] Generalization experiments: The transfer results on ORFD and Rellis lack specification of the protocol (zero-shot inference vs. fine-tuning) and do not include a matched single-frame baseline, weakening the assertion that the memory-attention mechanism drives robust cross-modal performance.

minor comments (2)

[Abstract] Abstract and method overview: The phrase 'flow-free framework' is used without a brief contrast to flow-based alternatives, which could be clarified for readers unfamiliar with the subfield.
[Implementation] The GitHub link is provided, but the manuscript does not include a reproducibility checklist or hyperparameter table, which would aid verification of the real-time claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment point by point below and have made corresponding revisions to the paper.

read point-by-point responses

Referee: Experiments section: The headline claims of +1.19% IoU and +0.71% F1 over prior methods are presented without exhaustive ablations (e.g., memory-attention removed, single-frame baseline with identical backbone and training schedule, or simple frame-stacking comparator), error bars across runs, or explicit validation protocol details; this leaves open whether gains derive from the architecture or from dataset-specific factors.

Authors: We agree that the original manuscript would be strengthened by these additional controls. In the revised version, we have added an ablation study that removes the memory-attention module, a single-frame baseline using the identical backbone and training schedule, and a simple frame-stacking comparator. We now report error bars computed over three independent runs with different random seeds and provide explicit details on the train/validation/test protocol and hyperparameter settings in the Experiments section. These new results confirm that the reported gains are attributable to the proposed architecture. revision: yes
Referee: Dataset section: The IRON train/test split description does not report temporal non-overlap criteria, scene diversity statistics, or cross-validation to rule out memorization of off-road textures or lighting patterns, which is load-bearing for the generalization claims.

Authors: We acknowledge the need for greater transparency on the split. The revised Dataset section now explicitly describes the temporal non-overlap criteria (no shared frames or consecutive sequences between train and test), provides scene diversity statistics (number of distinct locations, distribution across daytime/nighttime and weather conditions), and includes a cross-validation experiment that demonstrates consistent performance across different scene partitions, thereby addressing concerns about memorization. revision: yes
Referee: Generalization experiments: The transfer results on ORFD and Rellis lack specification of the protocol (zero-shot inference vs. fine-tuning) and do not include a matched single-frame baseline, weakening the assertion that the memory-attention mechanism drives robust cross-modal performance.

Authors: We thank the referee for highlighting this omission. The revised manuscript now clearly states that the ORFD and Rellis results were obtained via zero-shot inference with no fine-tuning on the target datasets. We have also added a matched single-frame baseline (same backbone, trained only on IRON) for direct comparison on both datasets, which isolates the contribution of the memory-attention mechanism to the observed cross-modal generalization. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark claims

full rationale

This is an empirical ML paper introducing the IRON dataset and evaluating IRONet on held-out test splits plus transfer to ORFD and Rellis. Reported IoU/F1 metrics are direct measurements from training and inference on those splits, not quantities that reduce by construction to fitted parameters, self-definitions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are presented that loop the performance claims back to the inputs; the central results remain independent empirical observations.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The central claims rest on the accuracy of manual annotations in the new dataset, the assumption that temporal context via attention improves consistency, and standard deep-learning training assumptions. No machine-checked proofs or parameter-free derivations are present.

free parameters (1)

model hyperparameters and training schedule
Standard neural network parameters tuned during development; not enumerated in abstract.

axioms (2)

domain assumption Dense manual annotations for freespace are accurate and consistent across frames and lighting conditions
Required for supervised training and evaluation of the detection task.
domain assumption Aggregating historical context via memory attention reduces inter-frame inconsistencies better than single-frame or flow-based alternatives
Core motivation and design choice for IRONet.

invented entities (2)

IRONet architecture no independent evidence
purpose: Flow-free temporal freespace detection
New model proposed to address the identified gaps.
memory-attention mechanism no independent evidence
purpose: Aggregating historical context without optical flow
Key component claimed to solve inter-frame inconsistency.

pith-pipeline@v0.9.0 · 5561 in / 1543 out tokens · 147410 ms · 2026-05-07T08:47:22.586818+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Deep depth estimation from thermal image,

U. Shin, J. Park, and I. S. Kweon, “Deep depth estimation from thermal image,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1043–1053

2023
[2]

Causal mode multiplexer: A novel framework for unbiased multispectral pedestrian detection,

T. Kim, S. Shin, Y . Yu, H. G. Kim, and Y . M. Ro, “Causal mode multiplexer: A novel framework for unbiased multispectral pedestrian detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 784–26 793

2024
[3]

Multispectral object detection enhanced by cross-modal information complementary and cosine similarity channel resampling modules,

J. Jang, C. Park, H. Kim, J. Lee, and J. Paik, “Multispectral object detection enhanced by cross-modal information complementary and cosine similarity channel resampling modules,” in2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025, pp. 9437–9446

2025
[4]

Infrared image super- resolution: A systematic review and future trends,

Y . Huang, T. Miyazaki, X. Liu, and S. Omachi, “Infrared image super- resolution: A systematic review and future trends,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2025

2025
[5]

Unirgb- ir: A unified framework for visible-infrared semantic tasks via adapter tuning,

M. Yuan, B. Cui, T. Zhao, J. Wang, S. Fu, X. Yang, and X. Wei, “Unirgb- ir: A unified framework for visible-infrared semantic tasks via adapter tuning,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 2409–2418

2025
[6]

Progressive domain adaptation for thermal infrared tracking,

Q. Li, K. Tan, D. Yuan, and Q. Liu, “Progressive domain adaptation for thermal infrared tracking,”Electronics, vol. 14, no. 1, p. 162, 2025

2025
[7]

Rellis-3d dataset: Data, benchmarks and analysis,

P. Jiang, P. Osteen, M. Wigness, and S. Saripalli, “Rellis-3d dataset: Data, benchmarks and analysis,” in2021 IEEE international conference on robotics and automation (ICRA). IEEE, 2021, pp. 1110–1116

2021
[8]

Orfd: A dataset and benchmark for off-road freespace detection,

C. Min, W. Jiang, D. Zhao, J. Xu, L. Xiao, Y . Nie, and B. Dai, “Orfd: A dataset and benchmark for off-road freespace detection,” in2022 international conference on robotics and automation (ICRA). IEEE, 2022, pp. 2532–2538

2022
[9]

The goose dataset for perception in unstructured environments,

P. Mortimer, R. Hagmanns, M. Granero, T. Luettel, J. Petereit, and H.-J. Wuensche, “The goose dataset for perception in unstructured environments,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 14 838–14 844

2024
[10]

Cat: Cavs traversability dataset for off-road autonomous driving,

S. Sharma, L. Dabbiru, T. Hannis, G. Mason, D. W. Carruth, M. Doude, C. Goodin, C. Hudson, S. Ozier, J. E. Ballet al., “Cat: Cavs traversability dataset for off-road autonomous driving,”IEEE access, vol. 10, pp. 24 759–24 768, 2022

2022
[11]

M2p2: A multi- modal passive perception dataset for off-road mobility in extreme low- light conditions,

A. Datar, A. Pokhrel, M. Nazeri, M. B. Rao, C. Pan, Y . Zhang, A. Harrison, M. Wigness, P. R. Osteen, J. Yeet al., “M2p2: A multi- modal passive perception dataset for off-road mobility in extreme low- light conditions,”arXiv preprint arXiv:2410.01105, 2024

work page arXiv 2024
[12]

Video semantic segmentation with inter- frame feature fusion and inner-frame feature refinement,

J. Zhuang, Z. Wang, and J. Li, “Video semantic segmentation with inter- frame feature fusion and inner-frame feature refinement,”arXiv preprint arXiv:2301.03832, 2023

work page arXiv 2023
[13]

Exploiting temporal state space sharing for video semantic segmentation,

S. A. S. Hesham, Y . Liu, G. Sun, H. Ding, J. Yang, E. Konukoglu, X. Geng, and X. Jiang, “Exploiting temporal state space sharing for video semantic segmentation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24 211–24 221

2025
[14]

Sne-roadseg+: Rethinking depth- normal translation and deep supervision for freespace detection,

H. Wang, R. Fan, P. Cai, and M. Liu, “Sne-roadseg+: Rethinking depth- normal translation and deep supervision for freespace detection,” in2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 1140–1145

2021
[15]

Bifnet: Bidirectional fusion network for road segmentation,

H. Li, Y . Chen, Q. Zhang, and D. Zhao, “Bifnet: Bidirectional fusion network for road segmentation,”IEEE transactions on cybernetics, vol. 52, no. 9, pp. 8617–8628, 2021

2021
[16]

Roadformer: Duplex transformer for rgb-normal semantic road scene parsing,

J. Li, Y . Zhang, P. Yun, G. Zhou, Q. Chen, and R. Fan, “Roadformer: Duplex transformer for rgb-normal semantic road scene parsing,”IEEE Transactions on Intelligent V ehicles, vol. 9, no. 7, pp. 5163–5172, 2024

2024
[17]

Rod: Rgb-only fast and efficient off-road freespace detection,

T. Sun, H. Ye, J. Mei, L. Chen, F. Zhao, L. Zong, and Y . Hu, “Rod: Rgb-only fast and efficient off-road freespace detection,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 9787–9793

2025
[18]

Mask propagation for efficient video semantic segmentation,

Y . Weng, M. Han, H. He, M. Li, L. Yao, X. Chang, and B. Zhuang, “Mask propagation for efficient video semantic segmentation,”Advances in Neural Information Processing Systems, vol. 36, pp. 7170–7183, 2023

2023
[19]

Global motion understanding in large-scale video object segmentation,

V . Fedynyak, Y . Romanus, O. Dobosevych, I. Babin, and R. Riazantsev, “Global motion understanding in large-scale video object segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3153–3162

2023
[20]

Petrv2: A unified framework for 3d perception from multi-camera images,

Y . Liu, J. Yan, F. Jia, S. Li, A. Gao, T. Wang, and X. Zhang, “Petrv2: A unified framework for 3d perception from multi-camera images,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 3262–3272

2023
[21]

Rgb-d video object segmentation via enhanced multi-store feature memory,

B. Xu, R. Hou, T. Ren, and G. Wu, “Rgb-d video object segmentation via enhanced multi-store feature memory,” inProceedings of the 2024 International Conference on Multimedia Retrieval, 2024, pp. 1016– 1024

2024
[22]

Evolve: Event-guided deformable feature transfer and dual-memory refinement for low-light video object segmentation,

J.-H. Baek, J. Oh, and Y . J. Koh, “Evolve: Event-guided deformable feature transfer and dual-memory refinement for low-light video object segmentation,” inProceedings of the IEEE/CVF International Confer- ence on Computer Vision, 2025, pp. 11 273–11 282

2025
[23]

The cityscapes dataset for semantic urban scene understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be- nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213– 3223

2016
[24]

Vision meets robotics: The kitti dataset,

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The international journal of robotics research, vol. 32, no. 11, pp. 1231–1237, 2013

2013
[25]

A rugd dataset for autonomous navigation and visual perception in unstructured outdoor environments,

M. Wigness, S. Eum, J. G. Rogers, D. Han, and H. Kwon, “A rugd dataset for autonomous navigation and visual perception in unstructured outdoor environments,” in2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 5000–5007

2019
[26]

Tartandrive 2.0: More modalities and better infrastructure to further self-supervised learning re- search in off-road driving tasks,

M. Sivaprakasam, P. Maheshwari, M. G. Castro, S. Triest, M. Nye, S. Willits, A. Saba, W. Wang, and S. Scherer, “Tartandrive 2.0: More modalities and better infrastructure to further self-supervised learning re- search in off-road driving tasks,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 12 606–12 606

2024
[27]

Advancing off-road autonomous driving: The large- scale orad-3d dataset and comprehensive benchmarks,

C. Min, J. Mei, H. Zhai, S. Wang, T. Sun, F. Kong, H. Li, F. Mao, F. Liu, S. Wanget al., “Advancing off-road autonomous driving: The large- scale orad-3d dataset and comprehensive benchmarks,”arXiv preprint arXiv:2510.16500, 2025

work page arXiv 2025
[28]

Of- froadsynth open dataset for semantic segmentation using synthetic-data- based weight initialization for autonomous ugv in off-road environ- ments,

K. Małek, J. Dybała, A. Kordecki, P. Hondra, and K. Kijania, “Of- froadsynth open dataset for semantic segmentation using synthetic-data- based weight initialization for autonomous ugv in off-road environ- ments,”Journal of Intelligent & Robotic Systems, vol. 110, no. 2, p. 76, 2024

2024
[29]

Kaist multi-spectral day/night data set for autonomous and as- sisted driving,

Y . Choi, N. Kim, S. Hwang, K. Park, J. S. Yoon, K. An, and I. S. Kweon, “Kaist multi-spectral day/night data set for autonomous and as- sisted driving,”IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 3, pp. 934–948, 2018

2018
[30]

Llvip: A visible-infrared paired dataset for low-light vision,

X. Jia, C. Zhu, M. Li, W. Tang, and W. Zhou, “Llvip: A visible-infrared paired dataset for low-light vision,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3496–3504

2021
[31]

Mfnet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,

Q. Ha, K. Watanabe, T. Karasawa, Y . Ushiku, and T. Harada, “Mfnet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,” in2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 5108–5115

2017
[32]

Flir adas dataset,

FLIR Systems, Inc., “Flir adas dataset,” Dataset, FLIR Systems, Inc., USA, accessed: 2025-10-27. [Online]. Available: https://oem.flir.com/ en-in/solutions/automotive/adas-dataset-form/

2025
[33]

Target-aware dual adversarial learning and a multi-scenario multi- modality benchmark to fuse infrared and visible for object detection,

J. Liu, X. Fan, Z. Huang, G. Wu, R. Liu, W. Zhong, and Z. Luo, “Target-aware dual adversarial learning and a multi-scenario multi- modality benchmark to fuse infrared and visible for object detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5802–5811

2022
[34]

U-net: Convolutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inInternational Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241

2015
[35]

Encoder- decoder with atrous separable convolution for semantic image segmen- tation,

L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder- decoder with atrous separable convolution for semantic image segmen- tation,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818

2018
[36]

Segformer: Simple and efficient design for semantic segmentation with transformers,

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,”Advances in neural information processing systems, vol. 34, pp. 12 077–12 090, 2021

2021
[37]

Masked-attention mask transformer for universal image segmentation,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1290–1299

2022
[38]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

2023
[39]

Sne-roadseg: Incorporating surface normal information into semantic segmentation for accurate freespace detection,

R. Fan, H. Wang, P. Cai, and M. Liu, “Sne-roadseg: Incorporating surface normal information into semantic segmentation for accurate freespace detection,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 340–356

2020
[40]

M2f2-net: Multi-modal feature fusion for unstructured off-road freespace detection,

H. Ye, J. Mei, and Y . Hu, “M2f2-net: Multi-modal feature fusion for unstructured off-road freespace detection,” in2023 IEEE Intelligent V ehicles Symposium (IV). IEEE, 2023, pp. 1–7

2023
[41]

Raft: Recurrent all-pairs field transforms for optical flow,

Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” inEuropean conference on computer vision. Springer, 2020, pp. 402–419

2020
[42]

Flowformer: A transformer architecture for optical flow,

Z. Huang, X. Shi, C. Zhang, Q. Wang, K. C. Cheung, H. Qin, J. Dai, and H. Li, “Flowformer: A transformer architecture for optical flow,” in European conference on computer vision. Springer, 2022, pp. 668–685

2022
[43]

SAM 2: Segment Anything in Images and Videos

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review arXiv 2024
[44]

Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree,

S. Ding, R. Qian, X. Dong, P. Zhang, Y . Zang, Y . Cao, Y . Guo, D. Lin, and J. Wang, “Sam2long: Enhancing sam 2 for long video segmentation with a training-free memory tree,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 13 614–13 624

2025
[45]

Samwise: Infusing wisdom in sam2 for text-driven video segmentation,

C. Cuttano, G. Trivigno, G. Rosi, C. Masone, and G. Averta, “Samwise: Infusing wisdom in sam2 for text-driven video segmentation,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3395–3405

2025
[46]

Camosam2: Motion-appearance induced auto-refining prompts for video camouflaged object detection,

X. Zhang, K. Fu, and Q. Zhao, “Camosam2: Motion-appearance induced auto-refining prompts for video camouflaged object detection,”arXiv preprint arXiv:2504.00375, 2025

work page arXiv 2025
[47]

Benchmarking a large- scale fir dataset for on-road pedestrian detection,

Z. Xu, J. Zhuang, Q. Liu, J. Zhou, and S. Peng, “Benchmarking a large- scale fir dataset for on-road pedestrian detection,”Infrared Physics & Technology, vol. 96, pp. 199–208, 2019

2019
[48]

Infraparis: A multi-modal and multi-task autonomous driving dataset,

G. Franchi, M. Hariat, X. Yu, N. Belkhir, A. Manzanera, and D. Filliat, “Infraparis: A multi-modal and multi-task autonomous driving dataset,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 2973–2983

2024
[49]

Automotive night vision thermal camera – ir-pilot series,

Ray Vision Technologies, “Automotive night vision thermal camera – ir-pilot series,” Webpage, Ray Vision Technologies, Pakistan, accessed: 2025-11-17. [Online]. Available: https://rayvisionpk.com/ automotive-night-vision-thermal-camera/

2025
[50]

Seeed studio,

L. Seeed Technology Co., “Seeed studio,” https://www.seeedstudio.com/, 2025, accessed: 2025-11-28

2025
[51]

Advanced auto labeling solution with added features,

W. Wang, “Advanced auto labeling solution with added features,” https: //github.com/CVHub520/X-AnyLabeling, CVHub, 2023

2023
[52]

Rtfnet: Rgb-thermal fusion network for semantic segmentation of urban scenes,

Y . Sun, W. Zuo, and M. Liu, “Rtfnet: Rgb-thermal fusion network for semantic segmentation of urban scenes,”IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 2576–2583, 2019

2019
[53]

Safe robot navigation via multi-modal anomaly detection,

L. Wellhausen, R. Ranftl, and M. Hutter, “Safe robot navigation via multi-modal anomaly detection,”IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1326–1333, 2020

2020
[54]

Self-supervised traversability prediction by learning to reconstruct safe terrain,

R. Schmid, D. Atha, F. Sch ¨oller, S. Dey, S. Fakoorian, K. Otsu, B. Ridge, M. Bjelonic, L. Wellhausen, M. Hutteret al., “Self-supervised traversability prediction by learning to reconstruct safe terrain,” in2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 12 419–12 425

2022
[55]

Learning off-road terrain traversability with self-supervisions only,

J. Seo, S. Sim, and I. Shim, “Learning off-road terrain traversability with self-supervisions only,”IEEE Robotics and Automation Letters, vol. 8, no. 8, pp. 4617–4624, 2023

2023
[56]

Convmae: Masked convolution meets masked autoencoders,

P. Gao, T. Ma, H. Li, Z. Lin, J. Dai, and Y . Qiao, “Convmae: Masked convolution meets masked autoencoders,”arXiv preprint arXiv:2205.03892, 2022

work page arXiv 2022
[57]

DINOv3

O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoaet al., “Dinov3,” arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review arXiv 2025
[58]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255

2009