pith. machine review for the scientific record. sign in

arxiv: 2604.26893 · v2 · submitted 2026-04-29 · 💻 cs.CV

Recognition: unknown

Graph-based Semantic Calibration Network for Unaligned UAV RGBT Image Semantic Segmentation and A Large-scale Benchmark

Authors on Pith no claims yet

Pith reviewed 2026-05-07 11:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords UAV RGBT segmentationunaligned multi-modal imagesfeature decouplinggraph attentionsemantic priorsfine-grained categoriescross-modal alignmentbenchmark dataset
0
0 comments X

The pith

Decoupling RGB and thermal features plus graph priors on object relations corrects misalignment and confusion in UAV aerial segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors aim to solve two linked problems in drone RGB-thermal imaging: spatial shifts between the two sensors and the difficulty of distinguishing many similar ground objects seen from above. They introduce a network that first splits each modality into shared structural information and private appearance details, then aligns the shared parts with deformable convolution to reduce interference. A second module builds an explicit graph of category hierarchies and typical co-occurrences, then uses graph attention to adjust predictions for rare or visually similar classes. The approach is evaluated on a new collection of more than 25,000 image pairs spanning 61 categories that deliberately includes realistic misalignment. If successful, the work shows that combining modality decoupling with structured semantic knowledge yields more reliable all-weather scene maps for UAV applications.

Core claim

GSCNet decouples each input modality into shared structural and private perceptual components, performs deformable alignment inside the shared subspace, and feeds the result into a Semantic Graph Calibration Module that encodes hierarchical taxonomy and co-occurrence regularities as a category graph; graph-attention reasoning then calibrates the final per-pixel predictions, producing measurable gains over prior methods especially on fine-grained categories when tested on the URTF collection of over 25,000 realistically misaligned RGB-thermal pairs across 61 classes.

What carries the argument

The Semantic Graph Calibration Module (SGCM), which converts hierarchical taxonomy and co-occurrence statistics into a structured category graph and applies graph-attention reasoning to adjust predictions for visually similar or rare ground-object classes.

If this is right

  • The network produces higher accuracy on fine-grained ground-object categories that are easily confused in top-down views.
  • Deformable alignment inside the shared structural subspace reduces the impact of appearance differences between RGB and thermal images.
  • Explicit encoding of category hierarchy and co-occurrence priors lowers error rates for both rare and visually similar classes.
  • The released URTF dataset of 25,000+ pairs supplies a standardized testbed for future unaligned RGBT segmentation methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decoupling-plus-graph pattern could be tested on other top-down multi-modal data such as satellite or autonomous-vehicle imagery where alignment is imperfect.
  • Separate ablation of the graph module on additional datasets would clarify whether the semantic priors generalize beyond the current benchmark.
  • The approach may extend to tasks that need both geometric correction and relational reasoning, such as multi-sensor object detection.

Load-bearing premise

That separating modalities into shared structural and private perceptual parts and then reasoning over an explicit category graph will correct both spatial misalignment and semantic confusion without creating new errors or overfitting to the benchmark.

What would settle it

Ablation results on the URTF benchmark in which removing the graph-attention module eliminates the reported gains on fine-grained and rare categories.

Figures

Figures reproduced from arXiv: 2604.26893 by Chenglong Li, Fangqiang Fan, Jin Tang, Xiaoliang Ma, Zhicheng Zhao.

Figure 2
Figure 2. Figure 2: Fine-grained semantic confusion in UAV aerial scenes. (a) Pole, view at source ↗
Figure 3
Figure 3. Figure 3: Overview of GSCNet. RGB and thermal images are processed by two modality-specific MiT-B4 encoders, and FDAM is inserted at all four stages view at source ↗
Figure 4
Figure 4. Figure 4: Overview of FDAM. (a) Asymmetric Feature Decoupling (AFD): view at source ↗
Figure 5
Figure 5. Figure 5: Overview of SGCM. Left: the initial prediction exhibits two typical failure modes: semantic confusion (Pole ˜ view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the two static prior matrices in SGCM for represen view at source ↗
Figure 7
Figure 7. Figure 7: RGB, thermal, and semantic annotation examples in URTF. The left side shows groups of RGB images, thermal images, and ground-truth labels view at source ↗
Figure 8
Figure 8. Figure 8: Key scenes in the URTF dataset: Skyscraper, Intersection, Bungalow, Street, Parking Lot, School, Pond, and Farmland, all captured at altitudes of view at source ↗
Figure 10
Figure 10. Figure 10: Long-tailed pixel distribution of the 61 semantic categories in URTF. view at source ↗
Figure 11
Figure 11. Figure 11: Visual comparisons among GSCNet and seven competing methods on the URTF benchmark in daytime (first 4 rows) and nighttime (last 4 rows) view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative ablation on the URTF benchmark. From left to right: view at source ↗
Figure 13
Figure 13. Figure 13: Sensitivity analysis of key hyperparameters on the URTF validation view at source ↗
read the original abstract

Fine-grained RGBT image semantic segmentation is crucial for all-weather unmanned aerial vehicle (UAV) scene understanding. However, UAV RGBT image semantic segmentation faces two coupled challenges: cross-modal spatial misalignment caused by sensor parallax and platform vibration, and severe semantic confusion among fine-grained ground objects under top-down aerial views. To address these issues, we propose a Graph-based Semantic Calibration Network (GSCNet) for unaligned UAV RGBT image semantic segmentation. Specifically, we design a Feature Decoupling and Alignment Module (FDAM) that decouples each modality into shared structural and private perceptual components and performs deformable alignment in the shared subspace, enabling robust spatial correction with reduced modality appearance interference. Moreover, we propose a Semantic Graph Calibration Module (SGCM) that explicitly encodes the hierarchical taxonomy and co-occurrence regularities among ground-object categories in UAV scenes into a structured category graph, and incorporates these priors into graph-attention reasoning to calibrate predictions of visually similar and rare categories. In addition, we construct the Unaligned RGB-Thermal Fine-grained (URTF) benchmark, to the best of our knowledge, the largest and most fine-grained benchmark for unaligned UAV RGBT image semantic segmentation, containing over 25,000 image pairs across 61 semantic categories with realistic cross-modal misalignment. Extensive experiments on URTF demonstrate that GSCNet significantly outperforms state-of-the-art methods, with notable gains on fine-grained categories. The dataset is available at https://github.com/mmic-lcl/Datasets-and-benchmark-code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes GSCNet for unaligned UAV RGBT semantic segmentation, consisting of FDAM (which decouples modalities into shared structural and private perceptual features then applies deformable alignment in the shared subspace) and SGCM (which builds a category graph encoding hierarchical taxonomy and co-occurrence priors and uses graph attention to calibrate predictions). It also introduces the URTF benchmark containing >25,000 image pairs across 61 fine-grained categories with realistic cross-modal misalignment, and reports that GSCNet significantly outperforms prior methods on this benchmark, especially on fine-grained classes.

Significance. If the empirical claims hold, the work directly targets two practical failure modes in UAV RGBT segmentation (spatial misalignment from parallax/vibration and semantic confusion among visually similar ground objects) via explicit modality decoupling and domain-specific priors. The release of a large-scale, fine-grained benchmark with realistic misalignment is a clear contribution that can support future research; the graph-based injection of taxonomy and co-occurrence knowledge is a principled way to regularize predictions without purely data-driven fitting.

major comments (2)
  1. §4 (Experiments): the central claim that GSCNet 'significantly outperforms' SOTA methods with 'notable gains on fine-grained categories' requires the full quantitative tables, per-class IoU breakdowns, and misalignment-severity stratification to be load-bearing; without them the outperformance cannot be verified as robust rather than benchmark-specific.
  2. §3.2 (SGCM): the graph-attention formulation must be shown to be non-circular (i.e., the taxonomy/co-occurrence priors are derived from external sources and not fitted on the test split of URTF); otherwise the calibration benefit on rare classes risks being an artifact of the benchmark construction.
minor comments (3)
  1. Abstract and §1: the description of FDAM states 'deformable alignment in the shared subspace' but does not specify the deformation field parameterization or the loss used to supervise alignment; a short equation or diagram reference would improve clarity.
  2. §2 (Related Work): the positioning against prior RGBT alignment methods (e.g., those using optical flow or attention) would benefit from a direct comparison table of misalignment-handling strategies.
  3. Dataset release: the GitHub link is provided, but the paper should include a brief description of the misalignment-generation protocol (sensor baseline, vibration simulation parameters) so that the benchmark can be reproduced or extended.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and constructive comments. We address each major comment point by point below, with revisions incorporated where appropriate to strengthen the manuscript.

read point-by-point responses
  1. Referee: §4 (Experiments): the central claim that GSCNet 'significantly outperforms' SOTA methods with 'notable gains on fine-grained categories' requires the full quantitative tables, per-class IoU breakdowns, and misalignment-severity stratification to be load-bearing; without them the outperformance cannot be verified as robust rather than benchmark-specific.

    Authors: We agree that comprehensive breakdowns are necessary to substantiate the claims. The submitted manuscript included overall mIoU results and selected per-class metrics in Table 2, but to address this directly we have added the complete per-class IoU table for all 61 categories to the revised Section 4.2 (now Table 3) and introduced a new misalignment-severity stratification analysis in Section 4.4. Performance is now reported separately for low-, medium-, and high-misalignment subsets (defined by measured parallax and vibration offsets), confirming consistent gains on fine-grained classes across severity levels. revision: yes

  2. Referee: §3.2 (SGCM): the graph-attention formulation must be shown to be non-circular (i.e., the taxonomy/co-occurrence priors are derived from external sources and not fitted on the test split of URTF); otherwise the calibration benefit on rare classes risks being an artifact of the benchmark construction.

    Authors: We concur that non-circularity must be explicitly verified. The hierarchical taxonomy is constructed from external, publicly available sources including WordNet-derived aerial object hierarchies and domain literature on UAV scene semantics, with no dependence on URTF. Co-occurrence priors are pre-computed from an independent corpus of 120,000 aerial images drawn from public datasets (e.g., DOTA, iSAID) that share no overlap with URTF's training or test splits. We have revised Section 3.2 to include a new subsection detailing this construction pipeline and added an explicit statement confirming zero leakage from the URTF test set. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces a novel two-module architecture (FDAM for modality decoupling and deformable alignment in shared subspace; SGCM for graph-attention encoding of taxonomy and co-occurrence priors) to address specific UAV RGBT challenges, plus a new URTF benchmark with 25k+ pairs and 61 categories. All central claims are empirical outperformance results on this held-out benchmark. No load-bearing equations, parameters, or premises reduce by construction to fitted inputs, self-definitions, or self-citation chains. The approach is presented as an independent design choice validated experimentally rather than derived tautologically from its own data or prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that cross-modal misalignment can be corrected via shared-subspace deformable alignment and that category co-occurrence priors encoded in a graph improve fine-grained predictions; no free parameters or invented physical entities are described.

axioms (2)
  • domain assumption Cross-modal spatial misalignment is caused by sensor parallax and platform vibration and can be mitigated by decoupling into shared structural and private perceptual components.
    Stated directly in the abstract as the motivation for FDAM.
  • domain assumption Hierarchical taxonomy and co-occurrence regularities among UAV ground-object categories can be encoded into a structured category graph that improves graph-attention reasoning for visually similar or rare classes.
    Core premise of SGCM in the abstract.

pith-pipeline@v0.9.0 · 5590 in / 1299 out tokens · 34573 ms · 2026-05-07T11:32:53.956255+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 2 canonical work pages

  1. [1]

    LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation,

    J. Wang, Z. Zheng, A. Ma, X. Lu, and Y . Zhong, “LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2021

  2. [2]

    UA Vid: A semantic segmentation dataset for UA V imagery,

    Y . Lyu, G. V osselman, G.-S. Xia, A. Yilmaz, and M. Y . Yang, “UA Vid: A semantic segmentation dataset for UA V imagery,”ISPRS J. Photogramm. Remote Sens., vol. 165, pp. 108–119, 2020

  3. [3]

    Semantic foggy scene under- standing with synthetic data,

    C. Sakaridis, D. Dai, and L. Van Gool, “Semantic foggy scene under- standing with synthetic data,”Int. J. Comput. Vis., vol. 126, no. 9, pp. 973–992, 2018

  4. [4]

    HeatNet: Bridging the day-night domain gap in semantic segmentation with thermal images,

    J. Vertens, J. Z ¨urn, and W. Burgard, “HeatNet: Bridging the day-night domain gap in semantic segmentation with thermal images,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2020, pp. 8461–8468

  5. [5]

    MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,

    Q. Ha, K. Watanabe, T. Karasawa, Y . Ushiku, and T. Harada, “MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2017, pp. 5108–5115

  6. [6]

    PST900: RGB-thermal calibration, dataset and segmen- tation network,

    S. S. Shivakumar, N. Rodrigues, A. Zhou, I. D. Miller, V . Kumar, and C. J. Taylor, “PST900: RGB-thermal calibration, dataset and segmen- tation network,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA), 2020, pp. 9441–9447

  7. [7]

    CMX: Cross-modal fusion for RGB-X semantic segmentation with transformers,

    J. Zhang, H. Liu, K. Yang, X. Hu, R. Liu, and R. Stiefelhagen, “CMX: Cross-modal fusion for RGB-X semantic segmentation with transformers,”IEEE Trans. Intell. Transp. Syst., vol. 24, no. 12, pp. 14 679–14 694, 2023

  8. [8]

    Weakly misalignment-free adaptive feature alignment for UA Vs-based multi- modal object detection,

    C. Chen, J. Qi, X. Liu, K. Bin, R. Fu, X. Hu, and P. Zhong, “Weakly misalignment-free adaptive feature alignment for UA Vs-based multi- modal object detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 26 836–26 845

  9. [9]

    Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation,

    J. Liu, Z. Liu, G. Wu, L. Ma, R. Liu, W. Zhong, Z. Luo, and X. Fan, “Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 8115–8124

  10. [10]

    MRFS: Mutually reinforcing image fusion and segmentation,

    H. Zhang, X. Zuo, J. Jiang, C. Guo, and J. Ma, “MRFS: Mutually reinforcing image fusion and segmentation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 26 974–26 983

  11. [11]

    CART: Cross-modal alignment for RGB-thermal seman- tic segmentation in UA V scenarios,

    C. Chenet al., “CART: Cross-modal alignment for RGB-thermal seman- tic segmentation in UA V scenarios,” inProc. IEEE Int. Conf. Multimedia Expo (ICME), 2024

  12. [12]

    Unleashing multispectral video’s potential in semantic segmentation: A semi-supervised viewpoint and new UA V-view benchmark,

    W. Ji, J. Li, W. Li, Y . Shen, H. Jinet al., “Unleashing multispectral video’s potential in semantic segmentation: A semi-supervised viewpoint and new UA V-view benchmark,”Adv. Neural Inf. Process. Syst., vol. 37, pp. 65 717–65 737, 2024

  13. [13]

    An RGB-TIR dataset from UA V platform for robust urban traffic scenes semantic segmentation,

    J. Ouyang, Q. Wang, Y . Shang, P. Jin, H. Zhong, L. Zhou, and T. Shen, “An RGB-TIR dataset from UA V platform for robust urban traffic scenes semantic segmentation,”Sci. Data, 2025. FANET AL.: UNALIGNED UA V RGBT IMAGE SEMANTIC SEGMENTATION 13

  14. [14]

    FuseNet: In- corporating depth into semantic segmentation via fusion-based CNN architecture,

    C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “FuseNet: In- corporating depth into semantic segmentation via fusion-based CNN architecture,” inProc. Asian Conf. Comput. Vis. (ACCV), 2016, pp. 213– 228

  15. [15]

    RTFNet: RGB-thermal fusion network for semantic segmentation of urban scenes,

    Y . Sun, W. Zuo, and M. Liu, “RTFNet: RGB-thermal fusion network for semantic segmentation of urban scenes,”IEEE Robot. Autom. Lett., vol. 4, no. 3, pp. 2576–2583, 2019

  16. [16]

    FEANet: Feature-enhanced attention network for RGB-thermal real-time semantic segmentation,

    F. Deng, H. Feng, M. Liang, H. Wang, Y . Yang, Y . Gao, J. Chen, J. Hu, X. Guo, and T. L. Lam, “FEANet: Feature-enhanced attention network for RGB-thermal real-time semantic segmentation,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2021, pp. 4467–4473

  17. [17]

    Edge-aware guidance fusion network for RGB–thermal scene parsing,

    W. Zhou, S. Dong, C. Xu, and Y . Qian, “Edge-aware guidance fusion network for RGB–thermal scene parsing,” inProc. AAAI Conf. Artif. Intell., vol. 36, no. 3, 2022, pp. 3571–3579

  18. [18]

    GMNet: Graded- feature multilabel-learning network for RGB-thermal urban scene se- mantic segmentation,

    W. Zhou, J. Liu, J. Lei, L. Yu, and J.-N. Hwang, “GMNet: Graded- feature multilabel-learning network for RGB-thermal urban scene se- mantic segmentation,”IEEE Trans. Image Process., vol. 30, pp. 7790– 7802, 2021

  19. [19]

    AMDANet: Attention-driven multi-perspective discrepancy alignment for RGB- infrared image fusion and segmentation,

    H. Zhong, F. Tang, Z. Chen, H. J. Chang, and Y . Gao, “AMDANet: Attention-driven multi-perspective discrepancy alignment for RGB- infrared image fusion and segmentation,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), October 2025, pp. 10 645–10 655

  20. [20]

    DFormerV2: Geome- try self-attention for RGBD semantic segmentation,

    B.-W. Yin, J.-L. Cao, M.-M. Cheng, and Q. Hou, “DFormerV2: Geome- try self-attention for RGBD semantic segmentation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025, pp. 19 345–19 355

  21. [21]

    MambaSeg: Harnessing Mamba for accurate and efficient image-event semantic segmentation,

    F. Gu, Y . Li, X. Long, K. Ji, C. Chen, Q. Gu, and Z. Ni, “MambaSeg: Harnessing Mamba for accurate and efficient image-event semantic segmentation,” inProc. AAAI Conf. Artif. Intell., 2026

  22. [22]

    Deformation-resilient multigranularity learning for unaligned RGB–T semantic segmentation,

    H. Zhou, Z. Zhang, C. Li, C. Tian, Y . Xie, Z. Li, and X.-J. Wu, “Deformation-resilient multigranularity learning for unaligned RGB–T semantic segmentation,”IEEE Trans. Neural Netw. Learn. Syst., vol. 36, no. 10, pp. 18 530–18 544, 2025

  23. [23]

    Fully convolutional networks for semantic segmentation,

    J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 3431–3440

  24. [24]

    DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,

    L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, 2017

  25. [25]

    Multi-Scale Context Aggregation by Dilated Convolutions

    F. Yu and V . Koltun, “Multi-scale context aggregation by dilated convolutions,”arXiv preprint arXiv:1511.07122, 2015

  26. [26]

    Encoder- decoder with atrous separable convolution for semantic image segmen- tation,

    L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder- decoder with atrous separable convolution for semantic image segmen- tation,” inProc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 801–818

  27. [27]

    Graph-based global reasoning networks,

    Y . Chen, M. Rohrbach, Z. Yan, S. Yan, J. Feng, and Y . Kalantidis, “Graph-based global reasoning networks,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019

  28. [28]

    Dynamic graph message passing networks,

    L. Zhang, D. Xu, A. Arnab, and P. H. Torr, “Dynamic graph message passing networks,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020

  29. [29]

    SAGRNet: A novel object-based graph convolutional neural network for diverse vegetation cover classification in remotely-sensed imagery,

    B. Gui, L. Sam, A. Bhardwaj, D. S. G ´omez, F. G. Pe ˜naloza, M. F. Buchroithner, and D. R. Green, “SAGRNet: A novel object-based graph convolutional neural network for diverse vegetation cover classification in remotely-sensed imagery,”ISPRS J. Photogramm. Remote Sens., vol. 227, pp. 99–124, 2025

  30. [30]

    Making better mistakes: Leveraging class hierarchies with deep networks,

    L. Bertinetto, R. Mueller, K. Tertikas, S. Samangooei, and N. A. Lord, “Making better mistakes: Leveraging class hierarchies with deep networks,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020

  31. [31]

    Multi-label image recognition with graph convolutional networks,

    Z.-M. Chen, X.-S. Wei, P. Wang, and Y . Guo, “Multi-label image recognition with graph convolutional networks,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 5177–5186

  32. [32]

    Deep image homogra- phy estimation,

    D. DeTone, T. Malisiewicz, and A. Rabinovich, “Deep image homogra- phy estimation,”arXiv preprint arXiv:1606.03798, 2016

  33. [33]

    Spatial transformer networks,

    M. Jaderberg, K. Simonyan, A. Zissermanet al., “Spatial transformer networks,”Adv. Neural Inf. Process. Syst., vol. 28, 2015

  34. [34]

    Flownet: Learning optical flow with convolutional networks,

    A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V . Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” inProc. IEEE Int. Conf. Comput. Vis. (ICCV), 2015, pp. 2758–2766

  35. [35]

    PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume,

    D. Sun, X. Yang, M.-Y . Liu, and J. Kautz, “PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 8934–8943

  36. [36]

    Deformable convnets v2: More deformable, better results,

    X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More deformable, better results,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 9308–9316

  37. [37]

    RegSeg: An end- to-end network for multimodal RGB-thermal registration and semantic segmentation,

    W. Lai, F. Zeng, X. Hu, S. He, Z. Liu, and Y . Jiang, “RegSeg: An end- to-end network for multimodal RGB-thermal registration and semantic segmentation,”IEEE Trans. Image Process., vol. 33, pp. 6676–6690, 2024

  38. [38]

    MISA: Modality-invariant and-specific representations for multimodal sentiment analysis,

    D. Hazarika, R. Zimmermann, and S. Poria, “MISA: Modality-invariant and-specific representations for multimodal sentiment analysis,” inProc. 28th ACM Int. Conf. Multimedia, 2020, pp. 1122–1131

  39. [39]

    Learning cross- modal common representations by private–shared subspaces separation,

    X. Xu, K. Lin, L. Gao, H. Lu, H. T. Shen, and X. Li, “Learning cross- modal common representations by private–shared subspaces separation,” IEEE Trans. Cybern., vol. 52, no. 5, pp. 3261–3275, 2020

  40. [40]

    SegFormer: Simple and efficient design for semantic segmentation with transformers,

    E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and efficient design for semantic segmentation with transformers,”Adv. Neural Inf. Process. Syst., vol. 34, pp. 12 077–12 090, 2021

  41. [41]

    Rgb-infrared cross-modality person re-identification via joint pixel and feature align- ment,

    G. Wang, T. Zhang, J. Cheng, S. Liu, Y . Yang, and Z. Hou, “Rgb-infrared cross-modality person re-identification via joint pixel and feature align- ment,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 3623–3632

  42. [42]

    RGBD salient object detection via disentangled cross-modal fusion,

    H. Chen, Y . Deng, Y . Li, T.-Y . Hung, and G. Lin, “RGBD salient object detection via disentangled cross-modal fusion,”IEEE Trans. Image Process., vol. 29, pp. 8407–8416, 2020

  43. [43]

    Graph attention networks,

    P. Veli ˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Li `o, and Y . Bengio, “Graph attention networks,” inProc. Int. Conf. Learn. Represent. (ICLR), 2018

  44. [44]

    FloodNet: A high resolution aerial imagery dataset for post flood scene understanding,

    M. Rahnemoonfar, T. Chowdhury, A. Sarkar, D. Varshney, M. Yari, and R. R. Murphy, “FloodNet: A high resolution aerial imagery dataset for post flood scene understanding,”IEEE Access, vol. 9, pp. 89 644–89 654, 2021

  45. [45]

    VDD: Varied drone dataset for semantic segmentation,

    W. Cai, K. Jin, J. Hou, C. Guo, L. Wu, and W. Yang, “VDD: Varied drone dataset for semantic segmentation,”J. Vis. Commun. Image Represent., 2025

  46. [46]

    SemanticRT: A large-scale dataset and method for robust semantic segmentation in multispectral images,

    W. Ji, J. Li, C. Bian, Z. Zhang, and L. Cheng, “SemanticRT: A large-scale dataset and method for robust semantic segmentation in multispectral images,” inProc. 31st ACM Int. Conf. Multimedia, 2023, pp. 3307–3316

  47. [47]

    Deep high-resolution represen- tation learning for human pose estimation,

    K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution represen- tation learning for human pose estimation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 5693–5703

  48. [48]

    Masked-attention mask transformer for universal image segmentation,

    B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022

  49. [49]

    SegNeXt: Rethinking convolutional attention design for semantic segmentation,

    M.-H. Guo, C.-Z. Lu, Q. Hou, Z. Liu, M.-M. Cheng, and S.-M. Hu, “SegNeXt: Rethinking convolutional attention design for semantic segmentation,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2022

  50. [50]

    Edge feature enhancement for fine-grained segmentation of remote sensing images,

    Z. Chen, T. Xu, Y . Pan, N. Shen, H. Chen, and J. Li, “Edge feature enhancement for fine-grained segmentation of remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–13, 2024

  51. [51]

    Delivering arbitrary-modal semantic segmentation,

    J. Zhang, R. Liu, H. Shi, K. Yang, S. Reiß, K. Peng, H. Fu, K. Wang, and R. Stiefelhagen, “Delivering arbitrary-modal semantic segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023

  52. [52]

    SGFNet: Semantic-guided fusion network for RGB-thermal semantic segmentation,

    Y . Wang, G. Li, and Z. Liu, “SGFNet: Semantic-guided fusion network for RGB-thermal semantic segmentation,”IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 12, pp. 7737–7748, 2023

  53. [53]

    Complementary random masking for RGB-thermal semantic segmentation,

    U. Shin, K. Lee, and I.-S. Kweon, “Complementary random masking for RGB-thermal semantic segmentation,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA), 2024, pp. 11 110–11 117

  54. [54]

    Gemini- Fusion: Efficient pixel-wise multimodal fusion for vision transformer,

    D. Jia, J. Guo, K. Han, H. Wu, C. Zhang, C. Xu, and X. Chen, “Gemini- Fusion: Efficient pixel-wise multimodal fusion for vision transformer,” inProc. Int. Conf. Mach. Learn. (ICML), 2024

  55. [55]

    ASANet: Asymmetric semantic aligning network for RGB and SAR image land cover classi- fication,

    P. Zhang, B. Peng, C. Lu, Q. Huang, and D. Liu, “ASANet: Asymmetric semantic aligning network for RGB and SAR image land cover classi- fication,”ISPRS J. Photogramm. Remote Sens., vol. 218, pp. 574–587, 2024

  56. [56]

    MiLNet: Multiplex interactive learning network for RGB-T semantic segmentation,

    J. Liu, H. Liu, X. Li, J. Ren, and X. Xu, “MiLNet: Multiplex interactive learning network for RGB-T semantic segmentation,”IEEE Trans. Image Process., vol. 34, pp. 1686–1699, 2025

  57. [57]

    Mul-VMamba: Multimodal semantic segmentation using selection-fusion-based vision- Mamba,

    R. Ni, Y . Guo, B. Yang, Y . Liu, H. Wang, and C. Hu, “Mul-VMamba: Multimodal semantic segmentation using selection-fusion-based vision- Mamba,”Knowl.-Based Syst., vol. 334, p. 115119, 2026