Recognition: unknown
Graph-based Semantic Calibration Network for Unaligned UAV RGBT Image Semantic Segmentation and A Large-scale Benchmark
Pith reviewed 2026-05-07 11:32 UTC · model grok-4.3
The pith
Decoupling RGB and thermal features plus graph priors on object relations corrects misalignment and confusion in UAV aerial segmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GSCNet decouples each input modality into shared structural and private perceptual components, performs deformable alignment inside the shared subspace, and feeds the result into a Semantic Graph Calibration Module that encodes hierarchical taxonomy and co-occurrence regularities as a category graph; graph-attention reasoning then calibrates the final per-pixel predictions, producing measurable gains over prior methods especially on fine-grained categories when tested on the URTF collection of over 25,000 realistically misaligned RGB-thermal pairs across 61 classes.
What carries the argument
The Semantic Graph Calibration Module (SGCM), which converts hierarchical taxonomy and co-occurrence statistics into a structured category graph and applies graph-attention reasoning to adjust predictions for visually similar or rare ground-object classes.
If this is right
- The network produces higher accuracy on fine-grained ground-object categories that are easily confused in top-down views.
- Deformable alignment inside the shared structural subspace reduces the impact of appearance differences between RGB and thermal images.
- Explicit encoding of category hierarchy and co-occurrence priors lowers error rates for both rare and visually similar classes.
- The released URTF dataset of 25,000+ pairs supplies a standardized testbed for future unaligned RGBT segmentation methods.
Where Pith is reading between the lines
- The same decoupling-plus-graph pattern could be tested on other top-down multi-modal data such as satellite or autonomous-vehicle imagery where alignment is imperfect.
- Separate ablation of the graph module on additional datasets would clarify whether the semantic priors generalize beyond the current benchmark.
- The approach may extend to tasks that need both geometric correction and relational reasoning, such as multi-sensor object detection.
Load-bearing premise
That separating modalities into shared structural and private perceptual parts and then reasoning over an explicit category graph will correct both spatial misalignment and semantic confusion without creating new errors or overfitting to the benchmark.
What would settle it
Ablation results on the URTF benchmark in which removing the graph-attention module eliminates the reported gains on fine-grained and rare categories.
Figures
read the original abstract
Fine-grained RGBT image semantic segmentation is crucial for all-weather unmanned aerial vehicle (UAV) scene understanding. However, UAV RGBT image semantic segmentation faces two coupled challenges: cross-modal spatial misalignment caused by sensor parallax and platform vibration, and severe semantic confusion among fine-grained ground objects under top-down aerial views. To address these issues, we propose a Graph-based Semantic Calibration Network (GSCNet) for unaligned UAV RGBT image semantic segmentation. Specifically, we design a Feature Decoupling and Alignment Module (FDAM) that decouples each modality into shared structural and private perceptual components and performs deformable alignment in the shared subspace, enabling robust spatial correction with reduced modality appearance interference. Moreover, we propose a Semantic Graph Calibration Module (SGCM) that explicitly encodes the hierarchical taxonomy and co-occurrence regularities among ground-object categories in UAV scenes into a structured category graph, and incorporates these priors into graph-attention reasoning to calibrate predictions of visually similar and rare categories. In addition, we construct the Unaligned RGB-Thermal Fine-grained (URTF) benchmark, to the best of our knowledge, the largest and most fine-grained benchmark for unaligned UAV RGBT image semantic segmentation, containing over 25,000 image pairs across 61 semantic categories with realistic cross-modal misalignment. Extensive experiments on URTF demonstrate that GSCNet significantly outperforms state-of-the-art methods, with notable gains on fine-grained categories. The dataset is available at https://github.com/mmic-lcl/Datasets-and-benchmark-code.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GSCNet for unaligned UAV RGBT semantic segmentation, consisting of FDAM (which decouples modalities into shared structural and private perceptual features then applies deformable alignment in the shared subspace) and SGCM (which builds a category graph encoding hierarchical taxonomy and co-occurrence priors and uses graph attention to calibrate predictions). It also introduces the URTF benchmark containing >25,000 image pairs across 61 fine-grained categories with realistic cross-modal misalignment, and reports that GSCNet significantly outperforms prior methods on this benchmark, especially on fine-grained classes.
Significance. If the empirical claims hold, the work directly targets two practical failure modes in UAV RGBT segmentation (spatial misalignment from parallax/vibration and semantic confusion among visually similar ground objects) via explicit modality decoupling and domain-specific priors. The release of a large-scale, fine-grained benchmark with realistic misalignment is a clear contribution that can support future research; the graph-based injection of taxonomy and co-occurrence knowledge is a principled way to regularize predictions without purely data-driven fitting.
major comments (2)
- §4 (Experiments): the central claim that GSCNet 'significantly outperforms' SOTA methods with 'notable gains on fine-grained categories' requires the full quantitative tables, per-class IoU breakdowns, and misalignment-severity stratification to be load-bearing; without them the outperformance cannot be verified as robust rather than benchmark-specific.
- §3.2 (SGCM): the graph-attention formulation must be shown to be non-circular (i.e., the taxonomy/co-occurrence priors are derived from external sources and not fitted on the test split of URTF); otherwise the calibration benefit on rare classes risks being an artifact of the benchmark construction.
minor comments (3)
- Abstract and §1: the description of FDAM states 'deformable alignment in the shared subspace' but does not specify the deformation field parameterization or the loss used to supervise alignment; a short equation or diagram reference would improve clarity.
- §2 (Related Work): the positioning against prior RGBT alignment methods (e.g., those using optical flow or attention) would benefit from a direct comparison table of misalignment-handling strategies.
- Dataset release: the GitHub link is provided, but the paper should include a brief description of the misalignment-generation protocol (sensor baseline, vibration simulation parameters) so that the benchmark can be reproduced or extended.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and constructive comments. We address each major comment point by point below, with revisions incorporated where appropriate to strengthen the manuscript.
read point-by-point responses
-
Referee: §4 (Experiments): the central claim that GSCNet 'significantly outperforms' SOTA methods with 'notable gains on fine-grained categories' requires the full quantitative tables, per-class IoU breakdowns, and misalignment-severity stratification to be load-bearing; without them the outperformance cannot be verified as robust rather than benchmark-specific.
Authors: We agree that comprehensive breakdowns are necessary to substantiate the claims. The submitted manuscript included overall mIoU results and selected per-class metrics in Table 2, but to address this directly we have added the complete per-class IoU table for all 61 categories to the revised Section 4.2 (now Table 3) and introduced a new misalignment-severity stratification analysis in Section 4.4. Performance is now reported separately for low-, medium-, and high-misalignment subsets (defined by measured parallax and vibration offsets), confirming consistent gains on fine-grained classes across severity levels. revision: yes
-
Referee: §3.2 (SGCM): the graph-attention formulation must be shown to be non-circular (i.e., the taxonomy/co-occurrence priors are derived from external sources and not fitted on the test split of URTF); otherwise the calibration benefit on rare classes risks being an artifact of the benchmark construction.
Authors: We concur that non-circularity must be explicitly verified. The hierarchical taxonomy is constructed from external, publicly available sources including WordNet-derived aerial object hierarchies and domain literature on UAV scene semantics, with no dependence on URTF. Co-occurrence priors are pre-computed from an independent corpus of 120,000 aerial images drawn from public datasets (e.g., DOTA, iSAID) that share no overlap with URTF's training or test splits. We have revised Section 3.2 to include a new subsection detailing this construction pipeline and added an explicit statement confirming zero leakage from the URTF test set. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper introduces a novel two-module architecture (FDAM for modality decoupling and deformable alignment in shared subspace; SGCM for graph-attention encoding of taxonomy and co-occurrence priors) to address specific UAV RGBT challenges, plus a new URTF benchmark with 25k+ pairs and 61 categories. All central claims are empirical outperformance results on this held-out benchmark. No load-bearing equations, parameters, or premises reduce by construction to fitted inputs, self-definitions, or self-citation chains. The approach is presented as an independent design choice validated experimentally rather than derived tautologically from its own data or prior author results.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Cross-modal spatial misalignment is caused by sensor parallax and platform vibration and can be mitigated by decoupling into shared structural and private perceptual components.
- domain assumption Hierarchical taxonomy and co-occurrence regularities among UAV ground-object categories can be encoded into a structured category graph that improves graph-attention reasoning for visually similar or rare classes.
Reference graph
Works this paper leans on
-
[1]
LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation,
J. Wang, Z. Zheng, A. Ma, X. Lu, and Y . Zhong, “LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2021
2021
-
[2]
UA Vid: A semantic segmentation dataset for UA V imagery,
Y . Lyu, G. V osselman, G.-S. Xia, A. Yilmaz, and M. Y . Yang, “UA Vid: A semantic segmentation dataset for UA V imagery,”ISPRS J. Photogramm. Remote Sens., vol. 165, pp. 108–119, 2020
2020
-
[3]
Semantic foggy scene under- standing with synthetic data,
C. Sakaridis, D. Dai, and L. Van Gool, “Semantic foggy scene under- standing with synthetic data,”Int. J. Comput. Vis., vol. 126, no. 9, pp. 973–992, 2018
2018
-
[4]
HeatNet: Bridging the day-night domain gap in semantic segmentation with thermal images,
J. Vertens, J. Z ¨urn, and W. Burgard, “HeatNet: Bridging the day-night domain gap in semantic segmentation with thermal images,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2020, pp. 8461–8468
2020
-
[5]
MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,
Q. Ha, K. Watanabe, T. Karasawa, Y . Ushiku, and T. Harada, “MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2017, pp. 5108–5115
2017
-
[6]
PST900: RGB-thermal calibration, dataset and segmen- tation network,
S. S. Shivakumar, N. Rodrigues, A. Zhou, I. D. Miller, V . Kumar, and C. J. Taylor, “PST900: RGB-thermal calibration, dataset and segmen- tation network,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA), 2020, pp. 9441–9447
2020
-
[7]
CMX: Cross-modal fusion for RGB-X semantic segmentation with transformers,
J. Zhang, H. Liu, K. Yang, X. Hu, R. Liu, and R. Stiefelhagen, “CMX: Cross-modal fusion for RGB-X semantic segmentation with transformers,”IEEE Trans. Intell. Transp. Syst., vol. 24, no. 12, pp. 14 679–14 694, 2023
2023
-
[8]
Weakly misalignment-free adaptive feature alignment for UA Vs-based multi- modal object detection,
C. Chen, J. Qi, X. Liu, K. Bin, R. Fu, X. Hu, and P. Zhong, “Weakly misalignment-free adaptive feature alignment for UA Vs-based multi- modal object detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 26 836–26 845
2024
-
[9]
Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation,
J. Liu, Z. Liu, G. Wu, L. Ma, R. Liu, W. Zhong, Z. Luo, and X. Fan, “Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 8115–8124
2023
-
[10]
MRFS: Mutually reinforcing image fusion and segmentation,
H. Zhang, X. Zuo, J. Jiang, C. Guo, and J. Ma, “MRFS: Mutually reinforcing image fusion and segmentation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 26 974–26 983
2024
-
[11]
CART: Cross-modal alignment for RGB-thermal seman- tic segmentation in UA V scenarios,
C. Chenet al., “CART: Cross-modal alignment for RGB-thermal seman- tic segmentation in UA V scenarios,” inProc. IEEE Int. Conf. Multimedia Expo (ICME), 2024
2024
-
[12]
Unleashing multispectral video’s potential in semantic segmentation: A semi-supervised viewpoint and new UA V-view benchmark,
W. Ji, J. Li, W. Li, Y . Shen, H. Jinet al., “Unleashing multispectral video’s potential in semantic segmentation: A semi-supervised viewpoint and new UA V-view benchmark,”Adv. Neural Inf. Process. Syst., vol. 37, pp. 65 717–65 737, 2024
2024
-
[13]
An RGB-TIR dataset from UA V platform for robust urban traffic scenes semantic segmentation,
J. Ouyang, Q. Wang, Y . Shang, P. Jin, H. Zhong, L. Zhou, and T. Shen, “An RGB-TIR dataset from UA V platform for robust urban traffic scenes semantic segmentation,”Sci. Data, 2025. FANET AL.: UNALIGNED UA V RGBT IMAGE SEMANTIC SEGMENTATION 13
2025
-
[14]
FuseNet: In- corporating depth into semantic segmentation via fusion-based CNN architecture,
C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “FuseNet: In- corporating depth into semantic segmentation via fusion-based CNN architecture,” inProc. Asian Conf. Comput. Vis. (ACCV), 2016, pp. 213– 228
2016
-
[15]
RTFNet: RGB-thermal fusion network for semantic segmentation of urban scenes,
Y . Sun, W. Zuo, and M. Liu, “RTFNet: RGB-thermal fusion network for semantic segmentation of urban scenes,”IEEE Robot. Autom. Lett., vol. 4, no. 3, pp. 2576–2583, 2019
2019
-
[16]
FEANet: Feature-enhanced attention network for RGB-thermal real-time semantic segmentation,
F. Deng, H. Feng, M. Liang, H. Wang, Y . Yang, Y . Gao, J. Chen, J. Hu, X. Guo, and T. L. Lam, “FEANet: Feature-enhanced attention network for RGB-thermal real-time semantic segmentation,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2021, pp. 4467–4473
2021
-
[17]
Edge-aware guidance fusion network for RGB–thermal scene parsing,
W. Zhou, S. Dong, C. Xu, and Y . Qian, “Edge-aware guidance fusion network for RGB–thermal scene parsing,” inProc. AAAI Conf. Artif. Intell., vol. 36, no. 3, 2022, pp. 3571–3579
2022
-
[18]
GMNet: Graded- feature multilabel-learning network for RGB-thermal urban scene se- mantic segmentation,
W. Zhou, J. Liu, J. Lei, L. Yu, and J.-N. Hwang, “GMNet: Graded- feature multilabel-learning network for RGB-thermal urban scene se- mantic segmentation,”IEEE Trans. Image Process., vol. 30, pp. 7790– 7802, 2021
2021
-
[19]
AMDANet: Attention-driven multi-perspective discrepancy alignment for RGB- infrared image fusion and segmentation,
H. Zhong, F. Tang, Z. Chen, H. J. Chang, and Y . Gao, “AMDANet: Attention-driven multi-perspective discrepancy alignment for RGB- infrared image fusion and segmentation,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), October 2025, pp. 10 645–10 655
2025
-
[20]
DFormerV2: Geome- try self-attention for RGBD semantic segmentation,
B.-W. Yin, J.-L. Cao, M.-M. Cheng, and Q. Hou, “DFormerV2: Geome- try self-attention for RGBD semantic segmentation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025, pp. 19 345–19 355
2025
-
[21]
MambaSeg: Harnessing Mamba for accurate and efficient image-event semantic segmentation,
F. Gu, Y . Li, X. Long, K. Ji, C. Chen, Q. Gu, and Z. Ni, “MambaSeg: Harnessing Mamba for accurate and efficient image-event semantic segmentation,” inProc. AAAI Conf. Artif. Intell., 2026
2026
-
[22]
Deformation-resilient multigranularity learning for unaligned RGB–T semantic segmentation,
H. Zhou, Z. Zhang, C. Li, C. Tian, Y . Xie, Z. Li, and X.-J. Wu, “Deformation-resilient multigranularity learning for unaligned RGB–T semantic segmentation,”IEEE Trans. Neural Netw. Learn. Syst., vol. 36, no. 10, pp. 18 530–18 544, 2025
2025
-
[23]
Fully convolutional networks for semantic segmentation,
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 3431–3440
2015
-
[24]
DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, 2017
2017
-
[25]
Multi-Scale Context Aggregation by Dilated Convolutions
F. Yu and V . Koltun, “Multi-scale context aggregation by dilated convolutions,”arXiv preprint arXiv:1511.07122, 2015
work page Pith review arXiv 2015
-
[26]
Encoder- decoder with atrous separable convolution for semantic image segmen- tation,
L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder- decoder with atrous separable convolution for semantic image segmen- tation,” inProc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 801–818
2018
-
[27]
Graph-based global reasoning networks,
Y . Chen, M. Rohrbach, Z. Yan, S. Yan, J. Feng, and Y . Kalantidis, “Graph-based global reasoning networks,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019
2019
-
[28]
Dynamic graph message passing networks,
L. Zhang, D. Xu, A. Arnab, and P. H. Torr, “Dynamic graph message passing networks,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020
2020
-
[29]
SAGRNet: A novel object-based graph convolutional neural network for diverse vegetation cover classification in remotely-sensed imagery,
B. Gui, L. Sam, A. Bhardwaj, D. S. G ´omez, F. G. Pe ˜naloza, M. F. Buchroithner, and D. R. Green, “SAGRNet: A novel object-based graph convolutional neural network for diverse vegetation cover classification in remotely-sensed imagery,”ISPRS J. Photogramm. Remote Sens., vol. 227, pp. 99–124, 2025
2025
-
[30]
Making better mistakes: Leveraging class hierarchies with deep networks,
L. Bertinetto, R. Mueller, K. Tertikas, S. Samangooei, and N. A. Lord, “Making better mistakes: Leveraging class hierarchies with deep networks,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020
2020
-
[31]
Multi-label image recognition with graph convolutional networks,
Z.-M. Chen, X.-S. Wei, P. Wang, and Y . Guo, “Multi-label image recognition with graph convolutional networks,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 5177–5186
2019
-
[32]
Deep image homogra- phy estimation,
D. DeTone, T. Malisiewicz, and A. Rabinovich, “Deep image homogra- phy estimation,”arXiv preprint arXiv:1606.03798, 2016
-
[33]
Spatial transformer networks,
M. Jaderberg, K. Simonyan, A. Zissermanet al., “Spatial transformer networks,”Adv. Neural Inf. Process. Syst., vol. 28, 2015
2015
-
[34]
Flownet: Learning optical flow with convolutional networks,
A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V . Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” inProc. IEEE Int. Conf. Comput. Vis. (ICCV), 2015, pp. 2758–2766
2015
-
[35]
PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume,
D. Sun, X. Yang, M.-Y . Liu, and J. Kautz, “PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 8934–8943
2018
-
[36]
Deformable convnets v2: More deformable, better results,
X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More deformable, better results,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 9308–9316
2019
-
[37]
RegSeg: An end- to-end network for multimodal RGB-thermal registration and semantic segmentation,
W. Lai, F. Zeng, X. Hu, S. He, Z. Liu, and Y . Jiang, “RegSeg: An end- to-end network for multimodal RGB-thermal registration and semantic segmentation,”IEEE Trans. Image Process., vol. 33, pp. 6676–6690, 2024
2024
-
[38]
MISA: Modality-invariant and-specific representations for multimodal sentiment analysis,
D. Hazarika, R. Zimmermann, and S. Poria, “MISA: Modality-invariant and-specific representations for multimodal sentiment analysis,” inProc. 28th ACM Int. Conf. Multimedia, 2020, pp. 1122–1131
2020
-
[39]
Learning cross- modal common representations by private–shared subspaces separation,
X. Xu, K. Lin, L. Gao, H. Lu, H. T. Shen, and X. Li, “Learning cross- modal common representations by private–shared subspaces separation,” IEEE Trans. Cybern., vol. 52, no. 5, pp. 3261–3275, 2020
2020
-
[40]
SegFormer: Simple and efficient design for semantic segmentation with transformers,
E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and efficient design for semantic segmentation with transformers,”Adv. Neural Inf. Process. Syst., vol. 34, pp. 12 077–12 090, 2021
2021
-
[41]
Rgb-infrared cross-modality person re-identification via joint pixel and feature align- ment,
G. Wang, T. Zhang, J. Cheng, S. Liu, Y . Yang, and Z. Hou, “Rgb-infrared cross-modality person re-identification via joint pixel and feature align- ment,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 3623–3632
2019
-
[42]
RGBD salient object detection via disentangled cross-modal fusion,
H. Chen, Y . Deng, Y . Li, T.-Y . Hung, and G. Lin, “RGBD salient object detection via disentangled cross-modal fusion,”IEEE Trans. Image Process., vol. 29, pp. 8407–8416, 2020
2020
-
[43]
Graph attention networks,
P. Veli ˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Li `o, and Y . Bengio, “Graph attention networks,” inProc. Int. Conf. Learn. Represent. (ICLR), 2018
2018
-
[44]
FloodNet: A high resolution aerial imagery dataset for post flood scene understanding,
M. Rahnemoonfar, T. Chowdhury, A. Sarkar, D. Varshney, M. Yari, and R. R. Murphy, “FloodNet: A high resolution aerial imagery dataset for post flood scene understanding,”IEEE Access, vol. 9, pp. 89 644–89 654, 2021
2021
-
[45]
VDD: Varied drone dataset for semantic segmentation,
W. Cai, K. Jin, J. Hou, C. Guo, L. Wu, and W. Yang, “VDD: Varied drone dataset for semantic segmentation,”J. Vis. Commun. Image Represent., 2025
2025
-
[46]
SemanticRT: A large-scale dataset and method for robust semantic segmentation in multispectral images,
W. Ji, J. Li, C. Bian, Z. Zhang, and L. Cheng, “SemanticRT: A large-scale dataset and method for robust semantic segmentation in multispectral images,” inProc. 31st ACM Int. Conf. Multimedia, 2023, pp. 3307–3316
2023
-
[47]
Deep high-resolution represen- tation learning for human pose estimation,
K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution represen- tation learning for human pose estimation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 5693–5703
2019
-
[48]
Masked-attention mask transformer for universal image segmentation,
B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022
2022
-
[49]
SegNeXt: Rethinking convolutional attention design for semantic segmentation,
M.-H. Guo, C.-Z. Lu, Q. Hou, Z. Liu, M.-M. Cheng, and S.-M. Hu, “SegNeXt: Rethinking convolutional attention design for semantic segmentation,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2022
2022
-
[50]
Edge feature enhancement for fine-grained segmentation of remote sensing images,
Z. Chen, T. Xu, Y . Pan, N. Shen, H. Chen, and J. Li, “Edge feature enhancement for fine-grained segmentation of remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–13, 2024
2024
-
[51]
Delivering arbitrary-modal semantic segmentation,
J. Zhang, R. Liu, H. Shi, K. Yang, S. Reiß, K. Peng, H. Fu, K. Wang, and R. Stiefelhagen, “Delivering arbitrary-modal semantic segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023
2023
-
[52]
SGFNet: Semantic-guided fusion network for RGB-thermal semantic segmentation,
Y . Wang, G. Li, and Z. Liu, “SGFNet: Semantic-guided fusion network for RGB-thermal semantic segmentation,”IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 12, pp. 7737–7748, 2023
2023
-
[53]
Complementary random masking for RGB-thermal semantic segmentation,
U. Shin, K. Lee, and I.-S. Kweon, “Complementary random masking for RGB-thermal semantic segmentation,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA), 2024, pp. 11 110–11 117
2024
-
[54]
Gemini- Fusion: Efficient pixel-wise multimodal fusion for vision transformer,
D. Jia, J. Guo, K. Han, H. Wu, C. Zhang, C. Xu, and X. Chen, “Gemini- Fusion: Efficient pixel-wise multimodal fusion for vision transformer,” inProc. Int. Conf. Mach. Learn. (ICML), 2024
2024
-
[55]
ASANet: Asymmetric semantic aligning network for RGB and SAR image land cover classi- fication,
P. Zhang, B. Peng, C. Lu, Q. Huang, and D. Liu, “ASANet: Asymmetric semantic aligning network for RGB and SAR image land cover classi- fication,”ISPRS J. Photogramm. Remote Sens., vol. 218, pp. 574–587, 2024
2024
-
[56]
MiLNet: Multiplex interactive learning network for RGB-T semantic segmentation,
J. Liu, H. Liu, X. Li, J. Ren, and X. Xu, “MiLNet: Multiplex interactive learning network for RGB-T semantic segmentation,”IEEE Trans. Image Process., vol. 34, pp. 1686–1699, 2025
2025
-
[57]
Mul-VMamba: Multimodal semantic segmentation using selection-fusion-based vision- Mamba,
R. Ni, Y . Guo, B. Yang, Y . Liu, H. Wang, and C. Hu, “Mul-VMamba: Multimodal semantic segmentation using selection-fusion-based vision- Mamba,”Knowl.-Based Syst., vol. 334, p. 115119, 2026
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.